Exploratory Data Analysis on Haberman Dataset
                        Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more.
                        HABERMAN DATASET
                        This is a dataset of breast cancer patient survival who went through surgery. This case study was conducted between 1958-1970. It has 4 attributes as follows.
                        
                        Columns
                        1. age - Age of Patients
                        2. year - Year on which they were operated on
                        3. nodes - number of nodes found
                        4. Status - 1/2
                                   1 - survived less than 5 years
                                   2 - survived more than 5 years
                        
                        
                        
                        1. DATA EXPLORATION
                        Data exploration is the first step towards data analysis. We visualize the data and try to understand it.
                        
                          
                          
                            
                              | 
                                 # importing libraries 
                                import pandas as pd 
                                import seaborn as sns 
                                import matplotlib.pyplot as plt 
                                import numpy as np 
                                
                                # Reading Data 
                                data=pd.read_csv('haberman.csv') 
                                # Reading first 5 columns of Data 
                                print(data.head) 
                               | 
                            
                          
                          
                        
                        
                        OUTPUT
                        age  year  nodes  status
0   30    64      1       1
1   30    62      3       1
2   30    65      0       1
3   31    59      2       1
4   31    65      4       1
                        
                          
                          
                            
                              
                                # shape 
                                print(data.shape)  
                                # columns  
                                print(data.columns)
                                # Class Labels 
                                print(data['status'].value_counts()) 
                                # IMBALANCED DATA 
                                # 1- lived<5 yrs 
                                # 2- lived>5 yrs 
                               | 
                            
                          
                          
                        
                        OUTPUT
                        SHAPE
                        (306, 4) 
                        COLUMNS
                        Index(['age', 'year', 'nodes', 'status'], dtype='object')
                        CLASS LABELS
                        1    225
2     81
Name: status, dtype: int64
                        INFO
                        <class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age       306 non-null int64
year      306 non-null int64
nodes     306 non-null int64
status    306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB
None
                        IMBALANCED DATASET - Data is distributed unequally. For ex. here, class-1 has 255 dataset and class-2 has 81 dataset.
                        Observation
                        
                          - 
                            
No. of rows - 306
                           
                          - 
                            
Number of columns columns - 4
                           
                          - 
                            
There is no missing value in any column
                           
                          - 
                            
Column Names - age, year, nodes, status
                           
                          - 
                            
All columns are of integer type.
                           
                          - 
                            
No. of people with status '1' - 225 (Lived more than 5 years)
                           
                          - 
                            
No. of people with status '2'- 81 (Lived less than 5 years)
                           
                          - 
                            
The dataset is imbalanced
                           
                        
                        
                        
                          
                          
                            
                              | 
                                 # Dataset Description 
                                print(data.describe()) 
                               | 
                            
                          
                          
                        
                        OUTPUT
                         age        year       nodes      status
count  306.000000  306.000000  306.000000  306.000000
mean    52.457516   62.852941    4.026144    1.264706
std     10.803452    3.249405    7.189654    0.441899
min     30.000000   58.000000    0.000000    1.000000
25%     44.000000   60.000000    0.000000    1.000000
50%     52.000000   63.000000    1.000000    1.000000
75%     60.750000   65.750000    4.000000    2.000000
max     83.000000   69.000000   52.000000    2.000000
                        
                        Observation 
                        
                        Describe function gives information like count, mean, std, percentiles, range of each and every column
                        
                        Mean
                        1. Average age of patients - 52 years
                        2. Average year of operation - 1962
                        3. Average no. of nodes found in patient - 4
                        
                        Range of Columns
                        1. AGE - (30-83)
                        2. YEAR - (1958-1969)
                        3. NODES - (0-52)
                        
                        Intuitive Inferences
                        Since number of average nodes is 4 and maximum no. of nodes is 52. 
                        Also, average of nodes till 75% percentile is 4 only.
                        Therefore, there are outliers in nodes column.
                        
                        Similarly, the average age of people below 75% is 60 years. 
                        And maximum age is 83 years.
                        Hence, there are chances of outliers there also. 
                        
                        *NOTE* : These are assumptions based on Intuition. They are not proved yet. 
                        
                        
                        2. BIVARIATE ANALYSIS 
                        Analysis of two variable is known as bivariate analysis. We can plot 2D scatter plots for plotting graph between two variables.
                        2-D Scatter Plot - provides visual image of the relationship between two variables.
                        
                          
                          
                            
                              | 
                                 # Using Seaborn 
                                # Between AGE and year  
                                # hue on status (for different colors) 
                                ​ 
                                sns.set_style("whitegrid"); 
                                sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "year").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        
                          
                          
                            
                              | 
                                 # AGE and NODES 
                                
                                sns.set_style("whitegrid"); 
                                sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "nodes").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        
                          
                          
                            
                              | 
                                 # YEAR and NODES 
                                
                                sns.set_style("whitegrid"); 
                                sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "year", "nodes").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        Observation
                        1. Most of the patients have nodes number less than 5.
                        2. Patients with age less than 40 have higher chances of survival and comparatively have less number of nodes.
                        3. People with nodes more than 10 and above age 50 have less chances of survival.
                        4. There are more no. of patients between age 40-65.
                        
                        Limitations of 2D scatter plots
                        Scatter plots are not very interpretable as they overlap a lot
                        As number of features increases, no. of pairs will increases, hence, plotting 2D graph of each pair will take time
                        To solve this problem we have pair plots wich is an inbuilt function in seaborn library.
                        It plots all the variable pairs with one line of code
                        
                        
                        
                        3. Pair-plots
                        Instead of plotting each and every pair seperately, we can use pair plots. 
                        Total no. of pair-plots - nC2
                        Limitaion - Not useful in case of high number of dimensions
                        
                          
                          
                            
                              | 
                                 plt.close(); 
                                sns.set_style("whitegrid"); 
                                sns.pairplot(data, hue="status", size=4, vars=['age','year','nodes']); 
                                plt.show() 
                               | 
                            
                          
                          
                        
                        
                        OBSERVATION
                        1. There is sharp decrease in people between number of nodes from 0-4. Most people have nodes near to zero.
                        2. People with less number of nodes have more chance of survival and vice-versa.
                        3. People above 5 nodes have survived almost half as compared to deaths.
                        4. Survival rate is more before year 1965 and comparatively less after that.
                        
                        
                        4. Univariate Analysis
                        Analyze each variable separately. Univariate analysis can be done using the following graphs. 
                        Histogram
                        It accurately represents numerical data distribution. It gives an estimate of continuous variables’ probability distribution. The histogram is a univariate analysis. We  divide the distribution into intervals, known as bins. The problem with 1D scatter plot is, it might overlap and hence, it becomes difficult to choose threshold. Farther the distribution are, it is better. 
                        Probability Density Function (PDF)
                        It describes how many values fall in a particular range. It is the probability function used to describe a continuous probability distribution. It deals with the probabilities of random variables with continuous outcomes. The area under the curve always sum up to 1. 
                        Cumulative Distribution Function (CDF)
                        it describes how many percentage is less than the particular length. The integration of Probability Density Function gives CDF. CDF is also a univariate analysis.
                        
                        
                          
                            
                              | 
                                 # 1D scatter plot using one feature (AGE) 
                                # loc takes only index labels and returns row if the index label exists 
                                live_more = data.loc[data["status"] == 1] 
                                live_less = data.loc[data["status"] == 2] 
                                plt.plot(live_more["age"], np.zeros_like(live_more['age']), 'o') 
                                plt.plot(live_less["age"], np.zeros_like(live_less['age']), 'o') 
                                plt.show() 
                               | 
                            
                          
                          
                        
                        
                        Observation
                        1. Too many overlapping data
                        2. People having age less than 35 age tend to survive more
                        
                        Probablity Density Function
                        PDF is the probablity distribution which tells how many numbers lie between a range
                        FacetGrid helps in visualization of one or more variables.
                        Seaborn distplot lets you show a histogram with a line on it.
                        
                          
                            
                              | 
                                 sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "age").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        Observation
                        
                        1. Overlapping indiactes that chances of survival cannot be determined clearly based on age. 
                        2. People below age 35 have high chances of surviving.
                        2. People between age 35-40 have almost double survival rate.
                        3. People between age 40-50 have less surviving chance.
                        5. People between age 50-65 have almost equal chance of surving.
                        6. People above 65 years have low survival rate.
                        7. No. of patients is first increasing till 50 years age and then decreasing. 
                        8. There are more people between age 40-70
                        
                        But age is not a very determining factor and no clear inferences can be made.
                        
                          
                            
                              | 
                                 sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "year").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        Observation
                        This indicates the surviving rate based on year of operation, which cannot be factor for deciding survival chance.
                        But, it can be seen that more operations were less successful till 1960, then no. of successful operations increased till 1963.
                        Again, there was a high rate of unsuccessful operations between 1963-1967.
                        
                          
                            
                              | 
                                 sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "nodes").add_legend(); 
                                plt.show(); 
                               | 
                            
                          
                          
                        
                        
                        Observations
                        1. Chances of survival decreases after 10 nodes.
                        2. Survival rate almost negligible after 25 nodes.
                        
                        Cummulative Distribution Function
                        It tells how many percentage of population is less than the particular value.
                        
                          
                            
                              | 
                                 #print(live_more) 
                                counts, bin_edges = np.histogram(live_more['nodes'], bins=10, density = True) 
                                pdf = counts/(sum(counts)) 
                                print(pdf) 
                                print(bin_edges) 
                                cdf = np.cumsum(pdf) 
                                plt.plot(bin_edges[1:],pdf) 
                                plt.plot(bin_edges[1:], cdf) 
                               | 
                            
                          
                          
                        
                        [0.83555556 0.08       0.02222222 0.02666667 0.01777778 0.00444444
 0.00888889 0.         0.         0.00444444]
[ 0.   4.6  9.2 13.8 18.4 23.  27.6 32.2 36.8 41.4 46. ]
                        
                        
                        
                          
                            
                              | 
                                 counts, bin_edges = np.histogram(live_less['nodes'], bins=10, density = True) 
                                pdf = counts/(sum(counts)) 
                                print(pdf) 
                                print(bin_edges) 
                                cdf = np.cumsum(pdf) 
                                plt.plot(bin_edges[1:],pdf) 
                                plt.plot(bin_edges[1:], cdf) 
                               | 
                            
                          
                          
                        
                        [0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
 0.01234568 0.         0.         0.01234568]
[ 0.   5.2 10.4 15.6 20.8 26.  31.2 36.4 41.6 46.8 52. ]
                        
                        
                        Observation
                        82-83% of people have nodes less than 4.6
                        
                        
                        Mean, Variance and Std-dev
                        1. Mean refers to the average of a particular column.
                        2. Variance indicates the spread of data
                        3. Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.
                        These are used for finding outliers. But, one single outlier can affect mean. Hence, we have median. Few outliers can't corrupt median but if more than 50% data is corrupt then median will also be affected.
                        
                          
                            
                              | 
                                 
                                 print('Means : ') 
                                print (np.mean(live_more['nodes'])) 
                                print (np.mean(live_less['nodes'])) 
                                
                                print('\nVariance : ') 
                                print (np.var(live_more['nodes'])) 
                                print (np.var(live_less['nodes'])) 
                                
                                print('\nStandard Deviation : ') 
                                print (np.std(live_more['nodes'])) 
                                print (np.std(live_less['nodes'])) 
                               | 
                            
                          
                          
                        
                        Means : 
2.7911111111111113
7.45679012345679
Variance : 
34.30747654320981
83.3345526596555
Standard Deviation : 
5.857258449412131
9.128776076761632
                        
                        Observation
                        1. People who survived more had only 2.7 average no. of nodes.
                        2. People who survived less had high 7.4 average no. of nodes.
                        
                        Median, Percentile, Quantile, IQR, MAD
                        1. Median refers to middle values. It is not prone to outliers as mean.
                        2. Quantile refers to percentage as 0,25,50,75.
                        3. IQR is inter-quartile range, which is range of quantiles.
                        4. MAD is Median absolute deviation, i.e., how deviated value is from median (center)
                        
                          
                            
                              | 
                                 print("Medians:") 
                                print(np.median(live_more['nodes'])) 
                                print(np.median(live_less['nodes'])) 
                                print("") 
                                
                                print("Quantiles:") 
                                print(np.percentile(live_more['nodes'],np.arange(0,100,25))) 
                                print(np.percentile(live_less['nodes'],np.arange(0,100,25))) 
                                print("") 
                                
                                
                                print("50th percentile") 
                                print(np.percentile(live_more['nodes'],50)) 
                                print(np.percentile(live_less['nodes'],50)) 
                                print("") 
                               | 
                            
                          
                          
                        
                        Medians:
0.0
4.0
Quantiles:
[0. 0. 0. 3.]
[ 0.  1.  4. 11.]
50th percentile
0.0
4.0
                        Observation
                        1. People who survived more had 0 nodes till 75% and only 3 nodes till 100%.
                        2. People who survived less had 4 nodes as median