Frequency:
In simple terms frequency depicts or illustrate the occurrences of values in data (or in an experiment). To calculate frequency of a value (or event ) we use frequency tables (tabular version) and graphs (histogram).
*In statistics the frequency of an event is the number of times the observation occurred/recorded in an experiment or study.
Relative frequency: A relative frequency can be calculated when we divide the frequency of particular value by the total number of data for each value. The sum of relative frequency table is generally one or close to one.
cumulative frequency: To find the cumulative relative frequency, add all of the previous relative frequencies to the relative frequency for the current row.
Here is an example
Histogram: A plot of the frequency table with the bins/ values on the x-axis and the count (or proportion) on the y-axis.
A density plot is a smoothed version of a histogram.
\
Typical Shapes of plots:
Shape is an important characteristic of smooth histogram. Depends on shapes we defined , what type of distribution it is.
Gaussian Distribution:
Gaussian distribution (also known as normal distribution) is a bell-shaped curve, Another name previously used for the normal distribution was the error distribution.
Error- In data science/ machine learning error is a difference between predicted value and the data value.
Variable: A variable is a property or characteristic whose value changes.
discrete variable - number of students present (count), continuous variable- weight of students ( measure)
Random Variable: A random variable is a variable whose value is a numerical outcome of a random experiments.
In simple terms when we say random that means everyone in a sample has equal opportunities to get selected . (5 orange and 5 apple in bucket in random choice every apple and orange has 1/10 opportunities (probability) to get chosen)
For example:
Let X represent the sum of two dice.
Then the probability distribution of X ;
X |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
P(X) |
1/36 |
2/36 |
3/36 |
4/36 |
5/36 |
6 |
/36 |
5/36 |
4/36 |
2/36 |
1/36 |
Normal Distribution Curve:
This is how normal distribution curve looks like. here mu (μ) is mean and sigma (σ) is standard deviation/variance.
The mean and standard deviation of a normal/gaussian distribution control how tall and wide it is.
For a random variable X gaussian distribution= X = GD (μ, σ). The empirical formula also known as the three-sigma formula or 68-95-99.7 formula.
Which states that the data for random variable X in gaussian distribution fall according to three-sigma or empirical formula (percentage of values that lie within a band around the mean in a normal distribution).
That is 68% data in first deviation , 95% data in second deviation and 99.7 % data in third deviation.
Standard Normal Distribution: The standard normal distribution (z distribution) is a normal distribution with a mean of 0 and a standard deviation of 1.
Any point (x) from a normal distribution can be converted to the standard normal distribution (z) by subtracting the mean then divide by the standard deviation; this is also called normalization or standardization. And this is also called z-score and z distribution.
Where X is random variable, μ (mean of X data) , σ (standard deviation) of X data.
If mean> variable data (X) then z score is negative.
positive or negative sign indicates whether it’s above or below the mean.
In simple terms the standard normal distribution is a scale to translate a normal distribution into numbers which may be used to learn more information about the data than was originally known.
What is Outlier ? Click Here to see answer