#learningstatistics — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #learningstatistics, aggregated by home.social.
-
So, to recap:
- sample means and standard deviations just happen to be optimal estimators of the parameters of a Gaussian distribution
- Gaussian distributions happen naturally (Central Limit Theorem), especially when mixing several causes to an effect so we can often fall back to them
- to construct a CI one has to build a probability around something independent of the very thing we're trying to estimate (otherwise circular dep!)
- it's easy when sigma is known (literally the CLT), but to extract something without both sigma and mu we need a bit more elbow grease (Student t)
- when not Gaussian we need moar math -
"Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n∈(0,∞) degrees of freedom, and that Z and V are independent. Random variable
T=Z/√(V/N) has the student t distribution with n degrees of freedom."This formula is very reminiscent from the one used to construct CIs of Gaussian samples with known std. dev., just with the sample estimate of sigma instead of an a priori fixed sigma.
-
Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:
"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."
-
A second more involved realization: I wish people writing pages/articles/courses told upfront why statistics textbooks are so full of some more complex distributions like Student t, chi-squared instead of harping for 20 pages about their properties.
I now understand that:
- the mean often follows a Gaussian distribution
- the variance often follows a chi-squared distribution (I think this really needs a good visualization)
- when sigma is known a priori Gaussian CIs of samples from a Gaussian variable are estimated from a Gaussian distribution ; when not it is a Student t distribution (it cancels both mean and std. dev) -
Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:
"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."
-
Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:
"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."
-
Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:
"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."
-
Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:
"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."
-
The last page is also part of a bunch of wiki pages that are... surely technically correct but difficult to grasp intuitively.
Note the difference between population (1/N) and sample (1/(N-1)) stats. The first has better mean squared error but biased with respect to the population, and the second has worse MSE but is unbiased with respect to the population.
I spent some time trying to grasp that, and came to the conclusion that in practical terms it's not actionable for me yet: I either have large N, or my problem is small but more complex than a mean/var/std and I have no clue how to get an unbiased estimator for that. 🧵
-
Some of the answers in the last link do point out interesting results : sample mean and variance are optimal for a Gaussian distribution.
https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation adds on that the midrange ((min+max)/2) would be optimal for unknown bounded distributions ? 🧵
-
An intuition I haven't yet verified: when we qualify samples using means and standard deviations, a hidden assumption is often made of a normal (Gaussian) distribution.
This might be what we want (the central limit theorem applies in a lot of cases, and is essentially "throw enough distributions together in a big bowl, mix them up and you end up with a normally distributed smoothie") but this is not always the case.
https://stats.stackexchange.com/questions/493548/when-we-calculate-mean-and-variance-do-we-assume-data-are-normally-distributed has more to say on this, but I'm not fully satisfied because it focuses on the pure theoretical math side of things, not on "what people actually interpret it is". 🧵
-
What made my gears turn a little was : if instead of adding more data you can only take subsets of your samples? For example you're trying to write a color picker tool on a photo? Different subsets of equal pixel count (the size of the picker tool) come out of a Poisson distribution (plus extra after processing)... and their mean will change even if the color is supposed to be uniform.
The mean is itself a random variable, this means we can do statistics on it. For example, compute its mean (whose difference with the actual mean is the bias) standard deviation (which is called standard error of the estimator of the mean). 🧵
-
So, first, maybe means? I know, the things below might have been evident, I've been starting from a very low bar okay?
I hear about means and averages all the time.
One thing that surprised me a few years ago was that the mean of a random variable is itself a random variable.
This was not very obvious to me ; my naive viewpoint was, I think, colored by the fact that if you're just looking at tests were you decide the sampling (e.g. do a poll on 100 people), well, you can just add 100 more and get better results and the law of large numbers says you should get better as you add more, right? 🧵
-
One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.
A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵
-
One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.
A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵
-
One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.
A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵
-
One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.
A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵
-
One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.
A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵
-
So, I'm going to start a thread on trying to better understanding #statistics, in case anybody is interested. Boosts and/or clarifications welcome and appreciated!
It's been bugging me for a while that I don't seem to have a good intuitive grasp of statistics, and this is despite graduating from an engineering school — while I did get courses on probability theory and stuff like Markov chains or EM algorithms and whatnot these were engineering-focused. Case in point, I can't say I "get" confidence intervals. Neither do I understand statistical tests and the p-value outputs that are often presented as "obvious" in other fields. 🧵
-
The Graphs and Statistics are fascinating in this UN Report. It might provide good exercises for a data or statistcs class: redesign the pie charts, find the original figures and design tables... Just learn from it.. It's a shame the text was garbled when I tried to copy direct from the .pdf..
https://www.unodc.org/documents/data-and-analysis/gsh/2023/Global_study_on_homicide_2023_web.pdf
#HomicideRates #MurderRate #StatisticsClass #LearningStatistics #UncertaintyGraphs #UncertaintyVisualization -
An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.
-
An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.
-
An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.
-
An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.
-
An #introduction post: I love #Statistics and #Learning, which means that I love both #StatisticalLearning and #LearningStatistics. Currently, most of my professional focus is on #teaching #ResearchMethods and Statistics to #undergraduate #psychmajors. I also love #photography, #travel, and #food (who doesn’t), and have recently figured out a way to combine all of these loves into a three-week #StudyAbroad trip to #Japan where I get to teach a class on #PsychologyOfLanguage.