Question 1: Explain, What is Data?
Answer: Data is defined as information usually numerical in format.
Today's world is surrounded by data. Every time when you click your phone use a application you actually are producing data.
The internet itself is a huge data of different fields ; music, sports, governments , research, food, peoples, habits, science and technology and everything (mostly) you want to search you can find it on internet.
The data that we produce or create; processed through cleaning , inspecting and modeling to extract useful information, conclusions, and decision-making.
You can feel the ascendance of data in today's world. For example in online market , seller use data to extract buyer choice.
We now even have application that kept your health records such as your diet , sleeping patterns, heart rate etc... These applications only possible because we have data a enormous amount of data.
Question 2: What is Statistics ? Explain the difference between Descriptive and Inferential statistic ?
Answer: Statistics refers to the mathematics and techniques to analyze , collect, organize data and draw conclusion from data.
We have to form of statistics named as descriptive and inferential statistics.
Descriptive Statistics: Descriptive statistic can be defined as organizing and summarizing data using graphs and tables. For this tools are used in descriptive statistics are Mean , Median , Mode, Variance , Standard division etc...
Inferential statistics : Making conclusion or prediction using sample data from data.
Let's try to understand it with an example:
Let's suppose the total population of a city ABC is 10,000 . Using descriptive statistics we can represent various factors (gender, age, rich and poor etc..) of this population , for example how many female and male are there and we also can represent these factors on graph.
In this whole scenario we can use whole population (10,000).
But now there is question: how many people of the city ABC like Apple ?
As we know that it is not possible to ask everyone (10,000) if they like Apple or not ? In this case we can take a sample from the population for example we take a sample of 100 people and asked everyone.
And suppose 40 people out of 100 likes apple. From this we can conclude that 40% of total population of city ABC likes apple.
That is basically here we are making conclusion or inference using sample data ,this falls under inferential statistics. With inferential statistics, you take data from samples and make generalizations about a population.
As you can see that inferential statistics is not always accurate or depends on sample data. we use margin to defined it as 40%+- 3% like this and in second case we can take a large sample for more accurate result.
Question 3: Data types in machine learning/ Data science ?
Answer: Understanding types of data in data science (statistics of data science) is important to analyze and make decision properly on data.
Knowing a data type of in data analysis and in predictive modeling (machine learning) is helpful to determine the type of visual (graph, chart etc..), and computational method based on data we have.
For example the computational statistics for categorical and numeric data shall be different.
Types of data (in statistics terms level of measurement)
Categorical Data:
1. Nominal Category: Nominal data is discrete and unordered values. In this type of data we have to choose one option out of others. This type of data based on labels , so order doesn't matter here. In simple terms unordered category.
For example; Gender of a person that is either
A. Male
B. Female
In this case the order of male and female doesn't matter, a person have to choose one option, that he/she belongs.
One more example: Choose your First Language ?
A. Hindi
B. English
C. Tamil
D. Marathi
Binary Data: A special case of categorical data where nominal values represented as quantitative values but don't have quantitative meaning.
For example in above case we can defined Male-0 , Female -1 but note that there is no meaning of 1 and 0 as quantitative they are just labels.
2. Ordinal Category: Ordinal data same as nominal data the difference is order of values matters here. In simple terms ordered category.
For example: School grades, A, B, C, D, E, F etc.. here the order play an important role.
Numerical Data:
1. Discrete Data : This type of data can't be measured but counted. Meaning have a proper integer values.
For example 5 cats or dogs. we can't say 4.5 cats and dogs.
2. Continues Data: Data that take any value on an interval. For example height, weight of a person can be described as continues real number.
This type of data can't be counted but measured. Measurement of height, length etc..
3. Ratio/ Interval:
Interval Data: Interval is type of data where the difference between ordered values is same. Interval values have no true zero and also represent values below zero. For example temperature
A. -10
B. -5
C. 0
D. 5
E. 10
Ratio Data: Ratio values also same as interval values but the difference is Ratio values never fall below zero. Height and weight measure from 0 and above, but never fall below it.
Question 4: How Data Types work as deciding factor for graphs and statistics ?
Answer: Click Here for conceptual clarity of Statistics for Machine Learning and Data Science.
Answer: For Nominal Data we can use concepts of frequencies , percentage and proportion. there is no use of center tendencies and dispersion. Because there is repetition of some categories in data.
Nominal Data as we know defined categories so we can visualize this type of data using Pie chart and Bar Graphs.
For Ordinal Data we can use same concept as nominal data. We can visualize ordinal data using bar graphs and pie charts. But additionally we can also calculate mean, median and IQR to summarize ordinal data.
Continues Data can be summarized by using mean, median , IQR and distribution of continues data can be evaluated using range, deviation and variance.
The best appropriate graphs to visualize continues data are Histograms and box plot.
Now you know how the data type play an important role to decide which graph and statistics we should use to summarize and generalize our data to make best decision or to make correct visualization.
Here we just talked about simple visualization. But we have a lot of types of graphs and visualization methods , we can choose the most appropriate visualization depends on the data we have.
Machine Learning Interview Question and Answers | Data Science. (Part2)
Machine Learning Interview Questions Set 1