In this article we will discuss some statistics formulas that are important in data science point of view.
Note that some terminologies can confuse you because of different meaning in different disciplines (data science , information technology and statistics)
For example in statistics term independent variable or predictor variable used to predict a response or dependent variable. On the other hand in data science features are used to predict target.
Centre Tendencies:
Measure an estimation where most of the data (values) located.
Mean: The mean is sum of all values is divided by number of values.
For example we have a numbers ( 4,7,6,-13,6,8 ) and want to calculate mean of the numbers.
4+7+6+ (-13) +6+8/6 = 18/6 = 3.
Note: Note that n refers to total number of observations or values, in statistics N is used in formula for population (parameter) and n is used in formula for sample. But in data science it is not important you can use any.
Weightage Mean: Weightage mean is another type of mean in which each data value is multiplied by weight and their sum then divided by sum of weights.
For example we want to calculate the weightage mean of the set of numbers ( 3, 4, 6, 7, 3, 6,4, 4 )
Weightage of a number in any set of number can be different, based on problems, you have to recognize weightage of a number.
Now let's solve the above mentioned problem
Weightage mean =2(3)+3(4)+2(6)+1(7)/8 = 4.62
here 8 is sum of weights 2+3+2+1=8
Median: The median is a middle number in sorted set of numbers. If there is odd number of values in data then median is middle value of set and if there is even values of number then median is calculated using average of two middle numbers.
For example; 1,2, 3, 4, 5, 6, 7- median =4 and in 1, 2, 3, 4, 5, 6 median = 3+4/2=3.5
Mode: It is the value that most often found in a set of numbers.
For example 1, 2, 3, 3, 3, 4, 4, 4, 4 mode=4 .
In practical life we can say that mode of religion in India is Hindu. The mode is simple summary statistics of categorical data and it is generally not used for numeric data.
Dispersion / Measure of variabilities:
Measures of the amount by which values are dispersed or scattered
in a distribution.
Range: Range is the difference between largest and smallest values of in a set of numbers.
For example; 23, 2, 3, 4, 5, 6, 7, 0 then Range= 23-0 =23
As you can see that range is only depends on maximum and minimum values of a set, that is highly sensitive to the outlier (extreme value low and high) , so it is not very useful as a general measure of dispersion.
Mean Deviation: The deviation is calculated about mean , median and mode. But most useful deviation and widely used is mean deviation. Mean deviation is the average of deviations of each value in the from mean of dataset. The deviation tell us how dispersed the data is around the central value.
Standard deviation and variance: The best known measures for variability are the variance and the standard deviation, which are based on squared deviations. The variance is an average of the squared deviations, and the standard deviation is the square root of the variance.
Look at the formulas
The value of standard deviation can never be negative.
Question: calculate MD, SD and variance for (2,5,9,11,13).
Answer: mean = 40/5=8
Mean deviation = |2-5|+ |5-8|+|9-8|+|11-8|+|13-8|/5= 18/5 =3.6
Variance= By using above formula it is 13.25
Standard deviation = 3.6
Question: Why N is replaced by n-1 in sample variance/ standard deviation ?.
Answer: For detail explanation click here- Concept of replacing N to n-1 in sample variance/ standard deviation.
Interquartile Range (IQR) : The range of the middle 50 percent of a number of set. In simple terms it is a common measurement of variability and calculate difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR). It is done to avoid sensitivity to outlier like in range which is highly sensitive to outlier (extreme values low and high).
Q1= (n+1)/4 and Q3=3(n+1)/4 positions in sorted set of number.
Question: Calculate IQR for (7,9,9,10,11,11,13)
Q1=7+1/4=2 , Q3= 2x3=6 values at 2nd and 6th position
IQR= Q3-Q1 = 11- 9= 2 .