All about the sampling distribution of the sample mean
What is the sampling distribution of the sample mean?
We already know how to find parameters that describe a population, like mean, variance, and standard deviation. But we also know that finding these values for a population can be difficult or impossible, because it’s not usually easy to collect data for every single subject in a large population.
So, instead of collecting data for the entire population, we choose a subset of the population and call it a “sample.” We say that the larger population has ???N??? subjects, but the smaller sample has ???n??? subjects.
In the same way that we’d find parameters for the population, we can find statistics for the sample. Then, based on the statistic for the sample, we can infer that the corresponding parameter for the population might be similar to the corresponding statistic from the sample.
Sampling distribution of the sample mean
Consider the fact though that pulling one sample from a population could produce a statistic that isn’t a good estimator of the corresponding population parameter.
For example, maybe the mean height of girls in your class in ???65??? inches. Let’s say there are ???30??? girls in your class, and you take a sample of ???3??? of them. If you happened to pick the three tallest girls, then the mean of your sample will not be a good estimate of the mean of the population, because the mean height from your sample will be significantly higher than the mean height of the population. Similarly, if you instead just happened to choose the three shortest girls for your sample, your sample mean would be much lower than the actual population mean.
So how do we correct for this? Well, instead of taking just one sample from the population, we’ll take lots and lots of samples. In fact, if we want our sample size to be ???n=3??? girls, we could actually take a sample of every single combination of ???3??? girls in the class. We can find the total number of samples by calculating the combination
???_nC_k=\frac{n!}{k!(n-k)!}???
???_{30}C_{3}=\frac{30!}{3!(30-3)!}=\frac{30!}{3!27!}=\frac{30\cdot29\cdot28\cdot27\cdot26\cdot...}{3!(27\cdot26\cdot25\cdot24\cdot...)}=\frac{30\cdot29\cdot28}{3!}???
???_{30}C_{3}=\frac{30\cdot29\cdot28}{3\cdot2\cdot1}=\frac{10\cdot29\cdot28}{2\cdot1}=\frac{10\cdot29\cdot14}{1}=4,060???
In this example, if we used every possible sample (every possible combination of ???3??? girls), the number of samples (how many groups we use) is ???4,060??? and the sample size (how big each group is) is ???3??? girls.
We’d be sampling with replacement, which means we’ll pick a random sample of three girls, and then “put them back” into the population and pick another random sample of three girls. We’ll keep doing this over and over again, until we’ve sampled every possible combination of three girls in our class.
Every one of these samples has a mean, and if we collect all of these means together, we can create a probability distribution that describes the distribution of these means. This distribution is always normal (as long as we have enough samples, more on this later), and this normal distribution is called the sampling distribution of the sample mean.
Because the sampling distribution of the sample mean is normal, we can of course find a mean and standard deviation for the distribution, and answer probability questions about it.
Central limit theorem
We just said that the sampling distribution of the sample mean is always normal. In other words, regardless of whether the population distribution is normal, the sampling distribution of the sample mean will always be normal, which is profound! The central limit theorem is our justification for why this is true.
So in reality, most distributions aren’t normal, meaning that they don’t approximate the bell-shaped-curve of a normal distribution. Real-life distributions are all over the place because real-life phenomena don’t always follow a perfectly normal distribution.
The central limit theorem (CLT) is a theorem that gives us a way to turn a non-normal distribution into a normal distribution. It tells us that, even if a population distribution is non-normal, its sampling distribution of the sample mean will be normal for a large number of samples (at least ???30???).
The central limit theorem is useful because it lets us apply what we know about normal distributions, like the properties of mean, variance, and standard deviation, to non-normal distributions.
Mean, variance, and standard deviation
The mean of the sampling distribution of the sample mean will always be the same as the mean of the original non-normal distribution. In other words, the sample mean is equal to the population mean.
???\mu_{\bar x}=\mu???
If the population is infinite and sampling is random, or if the population is finite but we’re sampling with replacement, then the sample variance is equal to the population variance divided by the sample size, so the variance of the sampling distribution is given by
???\sigma_{\bar x}^2=\frac{\sigma^2}{n}???
where ???\sigma^2??? is the population variance and ???n??? is the sample size. The standard deviation of the sampling distribution, also called the sample standard deviation or the standard error or standard error of the mean, is therefore given by
???\sigma_{\bar x}=\frac{\sigma}{\sqrt{n}}???
where ???\sigma??? is population standard deviation and ???n??? is sample size.
Finite population correction factor
If the size of the population ???N??? is finite, and if you’re sampling without replacement from more than ???5\%??? of the population, then you have to used what’s called the finite population correction factor (FPC).
Without the FPC, the Central Limit Theorem doesn’t hold under those sampling conditions, and the standard error of the mean (or proportion) will be too big. Applying the FPC corrects the calculation by reducing the standard error to a value closer to what you would have calculated if you’d been sampling with replacement.
So under these sampling conditions, to find sample variance we should instead use
???\sigma_{\bar x}^2=\frac{\sigma^2}{n}\left(\frac{N-n}{N-1}\right)???
And then sample standard deviation would be
???\sigma_{\bar x}=\frac{\sigma}{\sqrt{n}}\sqrt{\frac{N-n}{N-1}}???
Conditions for inference
There are always three conditions that we want to pay attention to when we’re trying to use a sample to make an inference about a population.
Random sampling
Any sample we take needs to be a simple random sample. Often we’ll be told in the problem that sampling was random.
Normal condition, large counts
In general, we always need to be sure we’re taking enough samples, and/or that our sample sizes are large enough. In the case of the sampling distribution of the sample mean, ???30??? is a magic number for the number of samples we use to make a sampling distribution. In other words, we need to take at least ???30??? samples in order for the CLT to be valid.
If we take a large number of samples (at least ???30???), then we typically consider that to be enough samples in order to get a normally distributed sampling distribution of the sample mean.
But when we use fewer than ???30??? samples, we don’t have enough samples to shift the distribution from non-normal to normal, so the sampling distribution will follow the shape of the original distribution. So if the original distribution is right-skewed, the sampling distribution would be right-skewed; and if the original distribution is left-skewed, then the sampling distribution will also be left-skewed.
If the original distribution is normal, then this rule doesn’t apply because the sampling distribution will also be normal, regardless of how many samples we use, even if it’s fewer than ???30??? samples.
Independence condition, ???10\%??? rule
If we’re sampling with replacement, then the ???10\%??? rule tells us that we can assume the independence of our samples. But if we’re sampling without replacement (we’re not “putting our subjects back” into the population every time we take a new sample), then we need keep the number of subjects in our samples below ???10\%??? of the total population (or keep the number of samples below ???10\%??? of the total population).
For example, if the original population is ???2,000??? subjects, we need to make sure that each sample we take to create the sampling distribution of the sample mean is less than ???200??? subjects. We can still take as many samples as we want to (the more, the better), but each sample needs to include ???200??? subjects or fewer so that we stay under the ???200/2,000=1/10=10\%??? threshold.
In other words, as long as we keep each sample at less than ???10\%??? of the total population, we can “get away with” a sample that isn’t truly independent (without replacement), because this ???10\%??? threshold actually approximates independence.
Building the sampling distribution of the sample mean
Take the course
Want to learn more about Probability & Statistics? I have a step-by-step course for that. :)
Probability that the sample mean is within an interval around the population mean
Example
A company produces soccer balls in a factory. Individual soccer balls are filled to an approximate pressure of ???8.7??? PSI (pounds per square inch), with a standard deviation of ???0.4??? PSI. The pressure in the soccer balls is normally distributed. The company randomly selects ???25??? soccer balls to check their pressure. What is the probability that the mean amount of pressure in these balls ???\bar x??? is within ???0.2??? PSI of the population mean?
Before we can try to answer this probability question, we need to check for normality. Our ???25??? soccer ball sample doesn’t meet the ???30??? sample threshold. However, because the population is approximately normal, the sampling distribution of the sample means will be normal as well, even with fewer than ???30??? samples.
If the population were a non-normal distribution (skewed to the right or left, or non-normal in some other way), the CLT would tell us that we’d need more than ???30??? samples in order for the sampling distribution of the sample mean to be normal.
It’s reasonable to assume independence, since ???25??? soccer balls is certainly less than ???10\%??? of the soccer balls being produced in this soccer ball factory.
And we were told in the problem that the ???25??? soccer balls were randomly selected. Therefore, with an independent, random sample from a normal population, we know the sample distribution of the sample mean will also be normal, and we can move forward with answering the probability question.
The sample mean ???\bar x??? will be equal to the population mean, so ???\bar x=8.7???. The standard deviation of the sampling distribution of the sample mean will be
???\sigma_{\bar x}=\frac{\sigma}{\sqrt{n}}???
???\sigma_{\bar x}=\frac{0.4}{\sqrt{25}}???
???\sigma_{\bar x}=\frac{0.4}{5}???
???\sigma_{\bar x}=0.08???
We want to know the probability that the sample mean ???\bar x??? is within ???0.2??? PSI of the population mean. We need to express ???0.2??? in terms of standard deviations.
???\frac{0.2}{0.08}=2.5???
Which means we want to know the probability of ???P(-2.5<z<2.5)???.
In a ???z???-table, a ???z???-value of ???2.5??? gives ???0.9938???,
and a value of ???-2.5??? gives ???0.0062???.
Which means the probability under the normal curve between these ???z???-scores is
???P(-2.5<z<2.5)=0.9938-0.0062???
???P(-2.5<z<2.5)=0.9876???
???P(-2.5<z<2.5)\approx99\%???
Which means there’s an approximately ???99\%??? chance that our sample mean will fall within ???0.2??? PSI of the population mean of ???8.7??? PSI.