#learningstatistics — Public Fediverse posts on home.social

rfnix @[email protected] · 2025-08-17 · 18:28 UTC

So, to recap:

- sample means and standard deviations just happen to be optimal estimators of the parameters of a Gaussian distribution
- Gaussian distributions happen naturally (Central Limit Theorem), especially when mixing several causes to an effect so we can often fall back to them
- to construct a CI one has to build a probability around something independent of the very thing we're trying to estimate (otherwise circular dep!)
- it's easy when sigma is known (literally the CLT), but to extract something without both sigma and mu we need a bit more elbow grease (Student t)
- when not Gaussian we need moar math

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/05%3A_Special_Distributions/5.10%3A_The_Student_t_Distribution

"Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n∈(0,∞) degrees of freedom, and that Z and V are independent. Random variable
T=Z/√(V/N) has the student t distribution with n degrees of freedom."

This formula is very reminiscent from the one used to construct CIs of Gaussian samples with known std. dev., just with the sample estimate of sigma instead of an a priori fixed sigma.

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

#statisticd #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

A second more involved realization: I wish people writing pages/articles/courses told upfront why statistics textbooks are so full of some more complex distributions like Student t, chi-squared instead of harping for 20 pages about their properties.

I now understand that:
- the mean often follows a Gaussian distribution
- the variance often follows a chi-squared distribution (I think this really needs a good visualization)
- when sigma is known a priori Gaussian CIs of samples from a Gaussian variable are estimated from a Gaussian distribution ; when not it is a Student t distribution (it cancels both mean and std. dev)

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

#statisticd #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

#statisticd #learningstatistics

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

#learningstatistics #statisticd

rfnix @[email protected] · 2025-08-17 · 18:17 UTC

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

#statisticd #learningstatistics

rfnix @[email protected] · 2025-08-10 · 17:42 UTC

The last page is also part of a bunch of wiki pages that are... surely technically correct but difficult to grasp intuitively.

Note the difference between population (1/N) and sample (1/(N-1)) stats. The first has better mean squared error but biased with respect to the population, and the second has worse MSE but is unbiased with respect to the population.

I spent some time trying to grasp that, and came to the conclusion that in practical terms it's not actionable for me yet: I either have large N, or my problem is small but more complex than a mean/var/std and I have no clue how to get an unbiased estimator for that. 🧵

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-10 · 17:26 UTC

Some of the answers in the last link do point out interesting results : sample mean and variance are optimal for a Gaussian distribution.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation adds on that the midrange ((min+max)/2) would be optimal for unknown bounded distributions ? 🧵

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-10 · 17:05 UTC

An intuition I haven't yet verified: when we qualify samples using means and standard deviations, a hidden assumption is often made of a normal (Gaussian) distribution.

This might be what we want (the central limit theorem applies in a lot of cases, and is essentially "throw enough distributions together in a big bowl, mix them up and you end up with a normally distributed smoothie") but this is not always the case.

https://stats.stackexchange.com/questions/493548/when-we-calculate-mean-and-variance-do-we-assume-data-are-normally-distributed has more to say on this, but I'm not fully satisfied because it focuses on the pure theoretical math side of things, not on "what people actually interpret it is". 🧵

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-10 · 16:26 UTC

What made my gears turn a little was : if instead of adding more data you can only take subsets of your samples? For example you're trying to write a color picker tool on a photo? Different subsets of equal pixel count (the size of the picker tool) come out of a Poisson distribution (plus extra after processing)... and their mean will change even if the color is supposed to be uniform.

The mean is itself a random variable, this means we can do statistics on it. For example, compute its mean (whose difference with the actual mean is the bias) standard deviation (which is called standard error of the estimator of the mean). 🧵

#statistics #LearningStatistics

#statistics #learningstatistics

rfnix @[email protected] · 2025-08-10 · 16:17 UTC

So, first, maybe means? I know, the things below might have been evident, I've been starting from a very low bar okay?

I hear about means and averages all the time.

One thing that surprised me a few years ago was that the mean of a random variable is itself a random variable.

This was not very obvious to me ; my naive viewpoint was, I think, colored by the fact that if you're just looking at tests were you decide the sampling (e.g. do a poll on 100 people), well, you can just add 100 more and get better results and the law of large numbers says you should get better as you add more, right? 🧵

#statistics #LearningStatistics

#learningstatistics #statistics

rfnix @[email protected] · 2025-08-10 · 15:50 UTC

One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.