Centre Limit Theorem: Let's try to understand central limit with an example; Suppose we have given a task in which we have to calculate the average age of population of India. Now you can think it's easy , note the age of population of Indian people and divide it by whole population.
But now the problem is that the population of India is so vast that it is not possible to note everyone's age and if we try this method it will take time, money and people to conduct survey on age of people.
Then how to solve this problem with appropriate method ?
Second approach - To solve this problem we can consider a group of people instead of whole population. This group of people is known as sample in statistics.
Steps involve to solve the average age of population in second approach: (Centre Limit Theorem)
- Instead of considering whole population choose random samples(group of people on which survey will be conduct) from population.
- All samples should be independent meaning have no influence on each other.
- Calculate the mean of each sample.
- Calculate the mean of each sample mean.
- By central limit theorem the mean of samples mean is equal to population mean.
- Now if we plot a histogram of these samples mean it resemble a shape curve ; that is normal distribution (or approximate to normal distribution).
- This how we calculate average or mean for large data.
Here you can ask a question how to decide sample size ? (Meaning how many people should be there in one group ? )
Sample Size (n): Simple size should be sufficient in size. Generally n>= 30. (n-sample size) is considered sufficient when the population is symmetric (population belongs to normal distribution).
And when the population is skewed or asymmetric, the sample size should be large.
Definition of Central Limit Theorem:
Central Limit Theorem stats that for a given dataset (population) with unknown distribution (normal, uniform ,binomial and any random distribution) the sample means will approximate the normal distribution.
For example; for a random variable X population mean μ and standard deviation is σ. And distribution of of random variable X is unknow (it can be normal, uniform, binomial etc...). Now let's divide random variable X into samples and take mean of each sample.
S1=x1, x2, x3,..............x30 = X1 bar (x̅1)
S2=x1, x2, x3,............x30 = X2 bar (x̅2)
S50 = x1,x2,x3........x30= X50 bar = (x̅50)
Now Mean of samples mean = population mean (μ ) and belongs to gaussian distribution this what CLT (Central limit theorem) stats.
The mean of sample mean and standard deviation of sample mean;
- µ X̄ = Mean of the sample means
- µ= Population mean
- σ X̄ = Standard deviation of the sample mean
- σ = Population standard deviation
- n = sample size
Significance of Central Limit Theorem :
- In many practices in statistics such as hypothesis testing or confidence intervals make some assumptions on population data one of them is that the populations that we work with are normally distributed. Practically it's seems not possible but with help of CLT (Central limit theorem) we can make a population normally distributed using samples of population.
- Used by pollsters to get an idea of election results.
Implementation of CLT in Python
This is a simple example to understand the effect of sample size and how to use central limit theorem in programming
import numpy as np
import seaborn as sns
#creating a population of 10,000 random data
#distribution plot for population
When Sample Size is small (n< 30)
#distribution of sample of population x (when n < 30)
x1= np.random.choice(x, size=20, replace=False)
#distribution of sample of population x (when n is large )
x1= np.random.choice(x, size=1000, replace=False)
Note: We can conclude from the above diagram and sample size that with large sample size graph is more normalized. Here by default population belongs to standard normal distribution (mu= 0 and sigma= 1) but it's not the case with every population.