Some basic statistics formulas for machine learning (data science) -part 2

Question

Some basic statistics formulas for machine learning (data science) -part 2

asked Dec 27, 2020 in Artificial Intelligence(AI) & Machine Learning by Nisha Goeduhub's Expert (3k points)
edited Feb 13 by Nisha

In this, Article we will discuss some statistics topics and concepts vital in data science. Statistics itself a large subject to study but as we know that machine learning and data science are the fields somehow depend on statistics. So, in this tutorial, we will take a look at some basic concepts which determine the scene behind machine learning and data science.

Centre Tendencies - Mean, Median, Mode .

Dispersion- Range , Interquartile Range (IQR) , Standard deviation , Variance.

Correlation , Frequencies , Proportion , Hypothesis and in inferences and it will be helpful if you basic knowledge of Probability and Algebra (vector)

1 Answer

answered Dec 27, 2020 by Nisha Goeduhub's Expert (3k points)
edited Dec 29, 2020 by Nisha

Best answer

For Basic Statistics click here (Part 1)

Frequency:

In simple terms frequency depicts or illustrate the occurrences of values in data (or in an experiment). To calculate frequency of a value (or event ) we use frequency tables (tabular version) and graphs (histogram).

*In statistics the frequency of an event is the number of times the observation occurred/recorded in an experiment or study.

Relative frequency: A relative frequency can be calculated when we divide the frequency of particular value by the total number of data for each value. The sum of relative frequency table is generally one or close to one.

cumulative frequency: To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.

Here is an example

Histogram: A plot of the frequency table with the bins/ values on the x-axis and the count (or proportion) on the y-axis.

A density plot is a smoothed version of a histogram.

densituy plot \

Typical Shapes of plots:

Shape is an important characteristic of smooth histogram. Depends on shapes we defined , what type of distribution it is.

shapes of plot

Gaussian Distribution:

Gaussian distribution (also known as normal distribution) is a bell-shaped curve, Another name previously used for the normal distribution was the error distribution.

Error- In data science/ machine learning error is a difference between predicted value and the data value.

Variable: A variable is a property or characteristic whose value changes.

discrete variable - number of students present (count), continuous variable- weight of students ( measure)

Random Variable: A random variable is a variable whose value is a numerical outcome of a random experiments.

In simple terms when we say random that means everyone in a sample has equal opportunities to get selected . (5 orange and 5 apple in bucket in random choice every apple and orange has 1/10 opportunities (probability) to get chosen)

For example:

Let X represent the sum of two dice.

Then the probability distribution of X ;

X	2	3	4	5	6	7	8	9	10	11	12
P(X)	1/36	2/36	3/36	4/36	5/36	6	/36	5/36	4/36	2/36	1/36

Normal Distribution Curve:

gaussian pro

This is how normal distribution curve looks like. here mu (μ) is mean and sigma (σ) is standard deviation/variance.

The mean and standard deviation of a normal/gaussian distribution control how tall and wide it is.

For a random variable X gaussian distribution= X = GD (μ, σ). The empirical formula also known as the three-sigma formula or 68-95-99.7 formula.

Which states that the data for random variable X in gaussian distribution fall according to three-sigma or empirical formula (percentage of values that lie within a band around the mean in a normal distribution).

That is 68% data in first deviation , 95% data in second deviation and 99.7 % data in third deviation.

Standard Normal Distribution: The standard normal distribution (z distribution) is a normal distribution with a mean of 0 and a standard deviation of 1.

Any point (x) from a normal distribution can be converted to the standard normal distribution (z) by subtracting the mean then divide by the standard deviation; this is also called normalization or standardization. And this is also called z-score and z distribution.

z score

Where X is random variable, μ (mean of X data) , σ (standard deviation) of X data.

If mean> variable data (X) then z score is negative.

positive or negative sign indicates whether it’s above or below the mean.

In simple terms the standard normal distribution is a scale to translate a normal distribution into numbers which may be used to learn more information about the data than was originally known.

What is Outlier ? Click Here to see answer

Some basic statistics formulas for machine learning (data science) -part 2

Please log in or register to answer this question.

1 Answer

For Basic Statistics click here (Part 1)

Frequency:

Gaussian Distribution:

Please log in or register to add a comment.

Study Resourses

Related questions