INTRODUCTION TO BOX PLOTS
Box plot is a method for graphically depicting numerical data through their quartile (We will see later). These may have lines extending vertically from the boxes indicating variability outside upper and lower quartiles which are also known as whiskers.
The spacing between the different parts of box shows the spread (Degree of dispersion) and the skewness in the data and shows the outliers (Outliers may be plotted as individual points).
What a Box Plot says?
The Box Plot shows us 5 point summary of data that includes following things:
1. Minimum (Lower Whisker)
2. First Quartile (Q1)
3. Median also known as Second Quartile (Q2)
4. Third Quartile (Q3)
5. Maximum (Upper Whisker)
Further it also shows the Inter Quartile Range (IQR) which is equal to Q3-Q1 and the values that are lower than Q1–1.5*IQR and greater than Q3+1.5*IQR that are denoted by dots in the figure are also known as outliers.
Why Box Plots?
Box Plot may seem primitive than a histogram or KDE estimate but the main advantage of using it is they use less space and are therefore particularly used for comparing distribution.
Now let us plot Box plot
1. First we need to import pyplot from matplotlib
import matplotlib.pyplot as plt
2. Now let us form a dummy data by forming lists and plot the box plots
value1 = [82,76,84,40,67,12,75,78,11,35,98,89,78,67,72,82,87,66,56,52]
value2 = [1,12,15,14,23,45,67,87,97,45,64,34,54,56,89,9,100,43,27,4]
print(pd.DataFrame(value1).describe())
n1 = np.array(value1)
p1 = np.percentile(n1,[25,50,75])
box_plot_data = [value1, value2]
plt.boxplot(box_plot_data, labels = [‘series1’,’series2'])
plt.show()
3. The same box plot can be made by seaborn library. Let us plot with seaborn
import seaborn as sns
sns.boxplot(data = box_plot_data)
4. Now let us plot a Box Plot for a table. For that we need to read a table first
mtcars = pd.read_csv(‘mtcars’)
mtcars .head()
(Now let us plot the box plot for the column ‘mpg’)
sns.boxplot(mtcars.mpg, orient = ‘v’, width = 0.35)
In the plot above we can see that the there is only one outlier in the column and the spread of data is also not much
(Note — You can also plot the box plot directly for all the numerical columns of the data by using mtcars.boxplot() )
5. Now let us plot box plot for ‘mpg’ with respect to a categorical column ‘gear’
sns.boxplot(mtcars.gear,mtcars.mpg , width = 0.35)