Some basic statistics formulas for machine learning (data science) -part 1

Question

Some basic statistics formulas for machine learning (data science) -part 1

asked Dec 16, 2020 in Artificial Intelligence(AI) & Machine Learning by Nisha Goeduhub's Expert (3.1k points)
edited Feb 13 by Nisha

In this, Article we will discuss some statistics topics and concepts vital in data science. Statistics itself a large subject to study but as we know that machine learning and data science are the fields somehow depend on statistics. So, in this tutorial, we will take a look at some basic concepts which determine the scene behind machine learning and data science.

Centre Tendencies - Mean, Median, Mode .

Dispersion- Range , Interquartile Range (IQR) , Standard deviation , Variance.

Correlation , Frequencies , Proportion , Hypothesis and in inferences and it will be helpful if you basic knowledge of Probability and Algebra (vector)

1 Answer

answered Dec 16, 2020 by Nisha Goeduhub's Expert (3.1k points)
edited Dec 27, 2020 by Nisha

Best answer

In this article we will discuss some statistics formulas that are important in data science point of view.

Note that some terminologies can confuse you because of different meaning in different disciplines (data science , information technology and statistics)

For example in statistics term independent variable or predictor variable used to predict a response or dependent variable. On the other hand in data science features are used to predict target.

Centre Tendencies:

Measure an estimation where most of the data (values) located.

Mean: The mean is sum of all values is divided by number of values.

For example we have a numbers ( 4,7,6,-13,6,8 ) and want to calculate mean of the numbers.

4+7+6+ (-13) +6+8/6 = 18/6 = 3.

mean

Note: Note that n refers to total number of observations or values, in statistics N is used in formula for population (parameter) and n is used in formula for sample. But in data science it is not important you can use any.

Weightage Mean: Weightage mean is another type of mean in which each data value is multiplied by weight and their sum then divided by sum of weights.

For example we want to calculate the weightage mean of the set of numbers ( 3, 4, 6, 7, 3, 6,4, 4 )

Weightage of a number in any set of number can be different, based on problems, you have to recognize weightage of a number.

Now let's solve the above mentioned problem

Weightage mean =2(3)+3(4)+2(6)+1(7)/8 = 4.62

here 8 is sum of weights 2+3+2+1=8

weightage mean

Median: The median is a middle number in sorted set of numbers. If there is odd number of values in data then median is middle value of set and if there is even values of number then median is calculated using average of two middle numbers.

For example; 1,2, 3, 4, 5, 6, 7- median =4 and in 1, 2, 3, 4, 5, 6 median = 3+4/2=3.5

Mode: It is the value that most often found in a set of numbers.

For example 1, 2, 3, 3, 3, 4, 4, 4, 4 mode=4 .

In practical life we can say that mode of religion in India is Hindu. The mode is simple summary statistics of categorical data and it is generally not used for numeric data.

Dispersion / Measure of variabilities:

Measures of the amount by which values are dispersed or scattered
in a distribution.

Range: Range is the difference between largest and smallest values of in a set of numbers.

For example; 23, 2, 3, 4, 5, 6, 7, 0 then Range= 23-0 =23

As you can see that range is only depends on maximum and minimum values of a set, that is highly sensitive to the outlier (extreme value low and high) , so it is not very useful as a general measure of dispersion.

Mean Deviation: The deviation is calculated about mean , median and mode. But most useful deviation and widely used is mean deviation. Mean deviation is the average of deviations of each value in the from mean of dataset. The deviation tell us how dispersed the data is around the central value.

mean deviation

Standard deviation and variance: The best known measures for variability are the variance and the standard deviation, which are based on squared deviations. The variance is an average of the squared deviations, and the standard deviation is the square root of the variance.

Look at the formulas

variance

The value of standard deviation can never be negative.

Question: calculate MD, SD and variance for (2,5,9,11,13).

Answer: mean = 40/5=8

Mean deviation = |2-5|+ |5-8|+|9-8|+|11-8|+|13-8|/5= 18/5 =3.6

Variance= By using above formula it is 13.25

Standard deviation = 3.6

Question: Why N is replaced by n-1 in sample variance/ standard deviation ?.

Answer: For detail explanation click here- Concept of replacing N to n-1 in sample variance/ standard deviation.

Interquartile Range (IQR) : The range of the middle 50 percent of a number of set. In simple terms it is a common measurement of variability and calculate difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR). It is done to avoid sensitivity to outlier like in range which is highly sensitive to outlier (extreme values low and high).

Q1= (n+1)/4 and Q3=3(n+1)/4 positions in sorted set of number.

Question: Calculate IQR for (7,9,9,10,11,11,13)

Q1=7+1/4=2 , Q3= 2x3=6 values at 2nd and 6th position

IQR= Q3-Q1 = 11- 9= 2 .

SUMMER TRAINING	Free Tutorials	Go To Your University	Placement Preparation
*Join our Telegram Channel To take free Online Courses*

Some basic statistics formulas for machine learning (data science) -part 1

Please log in or register to answer this question.

1 Answer

Centre Tendencies:

Dispersion / Measure of variabilities:

Click here for part 2

Please log in or register to add a comment.

Our Mentors(For AI-ML)

Related questions