This is start of a series of posts in which we will have a look at statistical analysis using predominantly Python. Statistical analysis involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn.
Statistical analysis can be broken down into following steps, as follows:
· Explore the relation of the data to the underlying population (EDA).
· Create a model to summarize understanding of how the data relates to the underlying population.
· Prove (or disprove) the validity of the model.
· Employ predictive analytics to run scenarios that will help guide future actions.
The goal of statistical analysis is to identify trends. A retail business, for example, might use statistical analysis to find patterns in unstructured and semi-structured customer data that can be used to create a more positive customer experience and increase sales.
In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.
In this post we will go through Exploratory Data Analysis, which is the first step towards most data analysis work.
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task in short The process of organizing, plotting and summarizing the data set is known as EDA. It often involves converting tabular data in graphical form and if done well, graphical representation can allow for more rapid interpretation of the data.
In this post we will look at a couple of ways to visualize the data with the intention of gaining some useful insight from it, using Python with its workhorse plotting library matplotlib, and also seaborn. The latter is built on top of matplotlib and offers simple api for advanced visualizations and better styling of plots by default.
In this post we will look at a couple of ways to visualize the data with the intention of gaining some useful insight from it, using Python with its workhorse plotting library matplotlib, and also seaborn. The latter is built on top of matplotlib and offers simple api for advanced visualizations and better styling of plots by default.
1. Plotting a Histogram:
· A histogram is essentially a plot of frequency distribution of data grouped into bins. Consider that we have to carefully measure the anatomical properties of samples of three different species of iris, Iris setosa, Iris versicolor, and Iris virginica. This is the popular iris dataset commonly used in data science. Here, we will work with the measurements of petal length.
We have 3 Numpy Arrays for each species consists of petal length.
Following is the code to plot histogram of versicolor petal lengths
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Label axes
plt.xlabel('petal length(cm)')
plt.ylabel('count')
# Show histogram
plt.show()
After executing these line of code we will have following histogram.
What we could see from Histogram is that the petal length ranges from 3.0-5.0 cm and majority of total sample size 50 are greater than 3.5
We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.
# Plot the histogram with 7 bins
plt.hist(versicolor_petal_length, bins = 7)
# Label axes
plt.xlabel('petal length (cm)')
plt.ylabel('count')
# Show histogram
plt.show()
After executing these line of code we will have following histogram
The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias.
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
# Set default Seaborn style
sns.set()
# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)
# Label axes
plt.xlabel('petal length(cm)')
plt.ylabel('count')
# Show histogram
plt.show()
After executing these line of code we will have following histogram.
We can plot histogram with multiple bins as well which gives a better idea about the data. The default no of bins are 10 but we can explicitly mention bins within plt.hist() as given below.
# Plot the histogram with 7 bins
plt.hist(versicolor_petal_length, bins = 7)
# Label axes
plt.xlabel('petal length (cm)')
plt.ylabel('count')
# Show histogram
plt.show()
After executing these line of code we will have following histogram
The biggest disadvantage of histogram is that the same data may be interpreted differently depending upon the choice of bins called as bin bias.
2. Plotting a Bee Swam :
- Lets make a bee swarm plot of the iris petal lengths. our x-axis should contain each of the three species, and the y-axis the petal lengths. A data frame containing the data is as df having columns as [ sepal length(cm), sepal width(cm), petal length(cm), petal width(cm), species
# Create bee swarm plot with Seaborn's default settings
sns.swarmplot(x='species',y='petal length (cm)',data=df)
# Label the axes
plt.xlabel('species')
plt.ylabel('petal length')
# Show the plot
plt.show()
After Executing these line of code we will have following Bee Swarm Plot
We can clearly see from the plot that virginica petals tend to be the longest, and setosa petals tend to be the shortest of the three species.
Suppose if we have to find that what is the % of the versicolor species having petal length less than 4 cms.
3. ECDF Empirical cumulative distribution function :
The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.
The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample.
- Lets define a ECDF function by using this function over and again we can plot ECDF plots.def ecdf(data):"""Compute ECDF for a one-dimensional array of measurements."""# Number of data points: nn = len(data)# x-data for the ECDF: xx = np.sort(data)# y-data for the ECDF: yy = np.arange(1, n+1) / nreturn x, y
We will now use ourecdf()
function to compute the ECDF for the petal lengths of versicolor flowers.
# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)
# Generate plot using above x_vers, y_vers which we found by ecdf() function.plt.plot(x_vers,y_vers,marker='.',linestyle = 'none')# Make the margins niceplt.margins(0.02)# Label the axesplt.xlabel('versicolor petal length')plt.ylabel('ECDF')# Display the plotplt.show()
After Executing these line of code we will have following ECDF PlotHere we can say that around 30% of versicolor petal length are less than 4 cms.We can plot ECDF for other species in the single plot for the better comparison and understanding.# Compute ECDFsx_set, y_set = ecdf(setosa_petal_length)x_vers, y_vers = ecdf(versicolor_petal_length)x_virg, y_virg = ecdf(virginica_petal_length )# Plot all ECDFs on the same plotplt.plot(x_set, y_set,marker='.',linestyle='none')plt.plot(x_vers, y_vers,marker='.',linestyle='none')plt.plot(x_virg, y_virg,marker='.',linestyle='none')# Make nice marginsplt.margins(0.02)# Annotate the plotplt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')plt.xlabel('petal length (cm)')plt.ylabel('ECDF')# Display the plotplt.show()After executing these line of code we will have following ECDF PlotWe can say that 40 % of setosa, versicolor, virginica petal length are less than 1.5 cms, 4.5 cms and 5.5 cms respectively.