Perform Exploratory Data Analysis on Haberman Breast Cancer Dataset

Question

Perform Exploratory Data Analysis on Haberman Breast Cancer Dataset

asked Apr 4 in AI-ML-Data Science Projects by Aparajita (750 points)

Data Scientists explores the data and analyze it to gather insightful information from it. Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more. Haberman dataset is collected from a study conducted between 1958-1970, at the University of Chicago's Billings Hospital, on breast cancer patient survival, who underwent surgery during this period.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

S.No.	Course Name	Apply Coupon
1.	Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence	Apply Coupon
2.	Computer Vision with OpenCV \| Deep Learning CNN Projects	Apply Coupon
3.	Complete Machine Learning & Data Science with Python	Apply Coupon
4.	Natural Language Processing-NLP with Deep Learning in Python	Apply Coupon
5.	Computer Vision OpenCV Python \| YOLO\| Deep Learning in Colab	Apply Coupon
6.	Complete Python Programming from scratch with Projects	Apply Coupon

2 Answers

answered Apr 4 by Aparajita (750 points)
selected Apr 6 by Aparajita

Best answer

Exploratory Data Analysis on Haberman Dataset

Exploratory Data Analysis is the process of exploring data and deriving some meaningful conclusions from it like finding patterns, hypothesis testing, dimensionality reduction, handling missing values and many more.

HABERMAN DATASET

This is a dataset of breast cancer patient survival who went through surgery. This case study was conducted between 1958-1970. It has 4 attributes as follows.

Columns

1. age - Age of Patients
2. year - Year on which they were operated on
3. nodes - number of nodes found
4. Status - 1/2
1 - survived less than 5 years
2 - survived more than 5 years

1. DATA EXPLORATION

Data exploration is the first step towards data analysis. We visualize the data and try to understand it.


# importing libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np # Reading Data data=pd.read_csv('haberman.csv') # Reading first 5 columns of Data print(data.head)

OUTPUT

age  year  nodes  status
0   30    64      1       1
1   30    62      3       1
2   30    65      0       1
3   31    59      2       1
4   31    65      4       1


# shape print(data.shape) # columns print(data.columns) # Class Labels print(data['status'].value_counts()) # IMBALANCED DATA # 1- lived<5 yrs # 2- lived>5 yrs

OUTPUT

SHAPE

(306, 4)

COLUMNS

Index(['age', 'year', 'nodes', 'status'], dtype='object')

CLASS LABELS

1    225
2     81
Name: status, dtype: int64

INFO

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age       306 non-null int64
year      306 non-null int64
nodes     306 non-null int64
status    306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB
None

IMBALANCED DATASET - Data is distributed unequally. For ex. here, class-1 has 255 dataset and class-2 has 81 dataset.

Observation

No. of rows - 306
Number of columns columns - 4
There is no missing value in any column
Column Names - age, year, nodes, status
All columns are of integer type.
No. of people with status '1' - 225 (Lived more than 5 years)
No. of people with status '2'- 81 (Lived less than 5 years)
The dataset is imbalanced


# Dataset Description print(data.describe())

OUTPUT

 age        year       nodes      status
count  306.000000  306.000000  306.000000  306.000000
mean    52.457516   62.852941    4.026144    1.264706
std     10.803452    3.249405    7.189654    0.441899
min     30.000000   58.000000    0.000000    1.000000
25%     44.000000   60.000000    0.000000    1.000000
50%     52.000000   63.000000    1.000000    1.000000
75%     60.750000   65.750000    4.000000    2.000000
max     83.000000   69.000000   52.000000    2.000000

Observation

Describe function gives information like count, mean, std, percentiles, range of each and every column

Mean

1. Average age of patients - 52 years

2. Average year of operation - 1962

3. Average no. of nodes found in patient - 4

Range of Columns

1. AGE - (30-83)

2. YEAR - (1958-1969)

3. NODES - (0-52)

Intuitive Inferences

Since number of average nodes is 4 and maximum no. of nodes is 52.

Also, average of nodes till 75% percentile is 4 only.

Therefore, there are outliers in nodes column.

Similarly, the average age of people below 75% is 60 years.

And maximum age is 83 years.

Hence, there are chances of outliers there also.

*NOTE* : These are assumptions based on Intuition. They are not proved yet.

2. BIVARIATE ANALYSIS

Analysis of two variable is known as bivariate analysis. We can plot 2D scatter plots for plotting graph between two variables.

2-D Scatter Plot - provides visual image of the relationship between two variables.


# Using Seaborn # Between AGE and year # hue on status (for different colors) sns.set_style("whitegrid"); sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "year").add_legend(); plt.show();

age year


# AGE and NODES sns.set_style("whitegrid"); sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "age", "nodes").add_legend(); plt.show();

NODES AGE


# YEAR and NODES sns.set_style("whitegrid"); sns.FacetGrid(data, hue="status", size=4).map(plt.scatter, "year", "nodes").add_legend(); plt.show();

year node

Observation

1. Most of the patients have nodes number less than 5.

2. Patients with age less than 40 have higher chances of survival and comparatively have less number of nodes.

3. People with nodes more than 10 and above age 50 have less chances of survival.

4. There are more no. of patients between age 40-65.

Limitations of 2D scatter plots

Scatter plots are not very interpretable as they overlap a lot

As number of features increases, no. of pairs will increases, hence, plotting 2D graph of each pair will take time

To solve this problem we have pair plots wich is an inbuilt function in seaborn library.

It plots all the variable pairs with one line of code

3. Pair-plots

Instead of plotting each and every pair seperately, we can use pair plots.

Total no. of pair-plots - nC2

Limitaion - Not useful in case of high number of dimensions


plt.close(); sns.set_style("whitegrid"); sns.pairplot(data, hue="status", size=4, vars=['age','year','nodes']); plt.show()

pair plot

OBSERVATION

1. There is sharp decrease in people between number of nodes from 0-4. Most people have nodes near to zero.

2. People with less number of nodes have more chance of survival and vice-versa.

3. People above 5 nodes have survived almost half as compared to deaths.

4. Survival rate is more before year 1965 and comparatively less after that.

4. Univariate Analysis

Analyze each variable separately. Univariate analysis can be done using the following graphs.

Histogram

It accurately represents numerical data distribution. It gives an estimate of continuous variables’ probability distribution. The histogram is a univariate analysis. We divide the distribution into intervals, known as bins. The problem with 1D scatter plot is, it might overlap and hence, it becomes difficult to choose threshold. Farther the distribution are, it is better.

Probability Density Function (PDF)

It describes how many values fall in a particular range. It is the probability function used to describe a continuous probability distribution. It deals with the probabilities of random variables with continuous outcomes. The area under the curve always sum up to 1.

Cumulative Distribution Function (CDF)

it describes how many percentage is less than the particular length. The integration of Probability Density Function gives CDF. CDF is also a univariate analysis.

# 1D scatter plot using one feature (AGE)

# loc takes only index labels and returns row if the index label exists

live_more = data.loc[data["status"] == 1]

live_less = data.loc[data["status"] == 2]

plt.plot(live_more["age"], np.zeros_like(live_more['age']), 'o')

plt.plot(live_less["age"], np.zeros_like(live_less['age']), 'o')

plt.show()

1D scatter plot

Observation

1. Too many overlapping data

2. People having age less than 35 age tend to survive more

Probablity Density Function

PDF is the probablity distribution which tells how many numbers lie between a range

FacetGrid helps in visualization of one or more variables.

Seaborn distplot lets you show a histogram with a line on it.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "age").add_legend();

plt.show();

pdf

Observation

1. Overlapping indiactes that chances of survival cannot be determined clearly based on age.

2. People below age 35 have high chances of surviving.

2. People between age 35-40 have almost double survival rate.

3. People between age 40-50 have less surviving chance.

5. People between age 50-65 have almost equal chance of surving.

6. People above 65 years have low survival rate.

7. No. of patients is first increasing till 50 years age and then decreasing.

8. There are more people between age 40-70

But age is not a very determining factor and no clear inferences can be made.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "year").add_legend();

plt.show();

pdf

Observation

This indicates the surviving rate based on year of operation, which cannot be factor for deciding survival chance.

But, it can be seen that more operations were less successful till 1960, then no. of successful operations increased till 1963.

Again, there was a high rate of unsuccessful operations between 1963-1967.

sns.FacetGrid(data, hue="status", size=4).map(sns.distplot, "nodes").add_legend();

plt.show();

pdf

Observations

1. Chances of survival decreases after 10 nodes.

2. Survival rate almost negligible after 25 nodes.

Cummulative Distribution Function

It tells how many percentage of population is less than the particular value.

#print(live_more)

counts, bin_edges = np.histogram(live_more['nodes'], bins=10, density = True)

pdf = counts/(sum(counts))

print(pdf)

print(bin_edges)

cdf = np.cumsum(pdf)

plt.plot(bin_edges[1:],pdf)

plt.plot(bin_edges[1:], cdf)

[0.83555556 0.08       0.02222222 0.02666667 0.01777778 0.00444444
 0.00888889 0.         0.         0.00444444]
[ 0.   4.6  9.2 13.8 18.4 23.  27.6 32.2 36.8 41.4 46. ]

cdf

counts, bin_edges = np.histogram(live_less['nodes'], bins=10, density = True)

pdf = counts/(sum(counts))

print(pdf)

print(bin_edges)

cdf = np.cumsum(pdf)

plt.plot(bin_edges[1:],pdf)

plt.plot(bin_edges[1:], cdf)

[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
 0.01234568 0.         0.         0.01234568]
[ 0.   5.2 10.4 15.6 20.8 26.  31.2 36.4 41.6 46.8 52. ]

cdf

Observation
82-83% of people have nodes less than 4.6

Mean, Variance and Std-dev

1. Mean refers to the average of a particular column.

2. Variance indicates the spread of data

3. Standard Deviation is square root of variance. It is a measure of the extent to which data varies from the mean.

These are used for finding outliers. But, one single outlier can affect mean. Hence, we have median. Few outliers can't corrupt median but if more than 50% data is corrupt then median will also be affected.

print('Means : ')

print (np.mean(live_more['nodes']))

print (np.mean(live_less['nodes']))

print('\nVariance : ')

print (np.var(live_more['nodes']))

print (np.var(live_less['nodes']))

print('\nStandard Deviation : ')

print (np.std(live_more['nodes']))

print (np.std(live_less['nodes']))

Means : 
2.7911111111111113
7.45679012345679

Variance : 
34.30747654320981
83.3345526596555

Standard Deviation : 
5.857258449412131
9.128776076761632

Observation

1. People who survived more had only 2.7 average no. of nodes.

2. People who survived less had high 7.4 average no. of nodes.

Median, Percentile, Quantile, IQR, MAD

1. Median refers to middle values. It is not prone to outliers as mean.

2. Quantile refers to percentage as 0,25,50,75.

3. IQR is inter-quartile range, which is range of quantiles.

4. MAD is Median absolute deviation, i.e., how deviated value is from median (center)

print("Medians:")

print(np.median(live_more['nodes']))

print(np.median(live_less['nodes']))

print("")

print("Quantiles:")

print(np.percentile(live_more['nodes'],np.arange(0,100,25)))

print(np.percentile(live_less['nodes'],np.arange(0,100,25)))

print("")

print("50th percentile")

print(np.percentile(live_more['nodes'],50))

print(np.percentile(live_less['nodes'],50))

print("")

Medians:
0.0
4.0

Quantiles:
[0. 0. 0. 3.]
[ 0.  1.  4. 11.]

50th percentile
0.0
4.0

Observation

1. People who survived more had 0 nodes till 75% and only 3 nodes till 100%.

2. People who survived less had 4 nodes as median

Aparajita · Answer 1 · 2021-04-05T10:27:04+0000

5. Box-Plots

Box plot - depicts lower to upper quartile values of the data, with a line at the median.

Whiskers - show the range of the data.

Outlier points are extra points which generally don't add much value to data.

#Box-plot can be visualized as a PDF on the side-ways.

sns.boxplot(x='status',y='age', data=data)

plt.show()

boxplot

Observation:

1. Lower age slighltly indicates high rate of survival

sns.boxplot(x='status',y='year', data=data)

plt.show()

boxplot

Observation

1. As year of operation is increasing, there are comparatively little high rate of success.

sns.boxplot(x='status',y='nodes', data=data)

plt.show()

boxplot

Observation

1. There are many outliers in nodes column.

2. There are few patients with number of nodes more than 10.

3. Less the no. of nodes, higher the survival chance.

4. After 4 number of nodes, there is less chances of survival, but still many people survived even after having higher number of nodes

6. Violin plots

Violin plot is the combination of a box plot and probability density function(CDF).

# Combines the benefits of the previous two plots and simplifies them

# Denser regions of the data are fatter, and sparser ones thinner

#in a violin plot

sns.violinplot(x="status", y="age", data=data, size=8)

plt.show()

violin plot

Observation

1. There are maximum number of people in the age group of 40-60

2. After 82 years, chances of survival is less.

3. Below 30 years, patient is likely to survive more

4. More deaths is seen in age group 40-50 as compared to survival rate.

sns.violinplot(x="status", y="year", data=data, size=8)

plt.show()

violin plot

Observation

1. No. of operations were more unsucessful in the year 1965.

2. Comparatively, more people survived till year 1960

sns.violinplot(x="status", y="nodes", data=data, size=8)

plt.show()

violin plot

Observation

1. Patient below 1 node are more likely to survive.

2. Despite having negligible number of nodes (closer to 0), there are some people who died.

3. Patients wih nodes more than 5 are more likely to die.

7. MULTIVARIATE ANALYSIS

Analysis of three or more variables

Contour plot

Contour plots (level plots) are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two predictor variables X Y on the y-axis and a response variable Z as contours.

sns.jointplot(x="age", y="year", data = data, kind = "kde")

plt.show()

Observation

1. As, the graph is denser between year 1960 to 1963, more operations were done on the patients in the age group 45 to 55.

2. There were comparatively more no. of operations between year 1958-1959 for the age group 37-43

FINAL CONCLUSION

1. Most people have nodes less than 4 and there is sharp decrease in patients with higher number of nodes. Most people have nodes between 0-1.

2. Higher number of nodes indiactes less chances of survival. But there are few people who survied with higher number of nodes and also there are people who died with almost no nodes. Hence, number of nodes alone cannot be strictly deciding factor.

3. Age and Year of operation alone cannot be deciding factor. But, more number of people survived with age below 35 years.

4. People with less below 40 were seen with lesser number of nodes and more survival rate.

If you discover the nodes at early age and get operated on, more is the chances of survival.

ONLINE SUMMER TRAINING	Online Courses	Free Tutorials	Placement Preparation

Perform Exploratory Data Analysis on Haberman Breast Cancer Dataset

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

Please log in or register to answer this question.

2 Answers

Exploratory Data Analysis on Haberman Dataset

1. DATA EXPLORATION

Observation

Observation

2. BIVARIATE ANALYSIS

Observation

3. Pair-plots

OBSERVATION

4. Univariate Analysis

Mean, Variance and Std-dev

Median, Percentile, Quantile, IQR, MAD

Please log in or register to add a comment.

5. Box-Plots

6. Violin plots

7. MULTIVARIATE ANALYSIS

FINAL CONCLUSION

Please log in or register to add a comment.

Our Mentors(For AI-ML)

Related questions