Saturday, July 1, 2023

How to Analyze Box Plots

 

There are two ways that you can analyze box plots. Consider the above plot showing sales of a particular brand per day across all stores over the months from February (2) and June (6). Let's assume June has 30 days. 

1. Analyze a single Box Plot

2. Compare two or more box plots

We'll study them separately.

1. Analyze a single Box Plot

There are Six points that you need to focus on when you are analyzing a single box Plot

    a. Total Size of the Plot

The total size of the plot indicates the range of the values. For example, in June month, the sales per day varies from 0 to 20000. 

    b. Absolute Position of Median

Roughly 50% of the values fall below this value, and 50% of the values fall above. So in June month, the median sales per day is about 7500 Rs. With about 15 days below 7500 Rs. and 15 days above 7500 Rs.  Considering the range from the size of plot in point b,  it is lower, as it should be around 10000 Rs. Which means that there are more days with values less than 10K then there are days with values more than 10K.

    c. Position of the Median Relative to Box

Median is at the lower half of the box. It indicates that the distribution is right skewed, which means that more days have low values and some days have higher values. This is also supported by the point b. combined with a. as indicated above. 

    d. Size of the box compared to the range. 

The size of the box indicates the Inter-quartile range i.e. the values between 1st quartile and 3rd quartile. It simply indicates the middle 50% values of the data. It is relatively robust and free from the extreme values. So we can see for June data that values lie roughly between 5000 and 12000. Their average is 8500 whereas median is at 7500, lower than the ideal mid. The IQR is about 12000-5000 = 7000 which when comparing with the range of 0 to 20000, is relatively less. It indicates that there are more extreme values. 

7. Relative Lengths of two Whiskers

We can see that upper whisker is more than lower whisker. Whiskers indicate extreme values. So the data has more extreme values at the upper end than extreme values at the lower end. 

8. Relative Lengths of Whiskers compared to the Box

We can see that  size of upper whisker is less than 1.5 times that of size of box. It means that the value of the end of whisker is the maximum value ( apart from outlier)  at about 20000. Similarly the size of lower whisker is less than 1.5 times less than the size of box. It means that the value of the end of lower whisker is the min value at about 0. 

9. Outliers

These are values that more than 1.5 times the IQR. So there is no outlier here. 

SUMMARY

The box plot analysis of daily sales data for the brand in June reveals several key insights. The total range of sales per day varies from 0 to 20,000 Rs., indicating a wide range of sales values. The median sales per day stands at around 7,500 Rs., with approximately half of the days below this value and the other half above. Notably, there are more days with sales below 10,000 Rs. than above, indicating a skewed distribution skewed towards lower sales. The interquartile range (IQR), representing the middle 50% of the data, spans from 5,000 to 12,000 Rs., which is relatively small compared to the overall range. This suggests the presence of more extreme sales values, particularly at the upper end. The absence of outliers suggests a consistent dataset. The length of the upper whisker exceeds that of the lower whisker, indicating more extreme sales values at the higher end.

So the sales per day in the month shows a high variability with more extreme values towards the upper part of data, thus indicating a right skewed data. This could have happened because of some event, probably a discount sale. 

1. Compare two or more box Plots

To compare two box plots, you can visually analyze their key components and consider the following aspects:

Size : Size of the plot from whisker to whisker can be used for comparison . If the size is more, the data in the plot is more spread out. 

Overlapping: Check if the boxes and whiskers of the two box plots overlap. If the boxes or whiskers overlap significantly, it suggests that the distributions of the two datasets have similarities in terms of central tendency and spread. On the other hand, if the boxes and whiskers do not overlap or have minimal overlap, it indicates potential differences between the distributions.

for example comparing May and June, the boxes overlap significantly. 

Median Comparison: Compare the positions of the medians in the two box plots. If one median is higher than the other, it suggests a difference in the central tendency of the two datasets. A higher median in one box plot indicates higher values or sales compared to the other dataset.

for example comparing May and June, the medians are similar

Quartiles: Examine the quartiles (Q1 and Q3) of the two box plots. If the two datasets have similar quartiles, it suggests similarities in the lower and upper ranges of the data. If the quartiles differ, it indicates differences in the spread of the data or the range of sales.

The quartiles are also similar

Outliers: Pay attention to any outliers in the box plots. Compare the presence, position, and magnitude of outliers in each plot. Unusual outliers may indicate unique patterns or extreme values in one dataset compared to the other.

There is an outlier in May.

Overall Shape: Observe the overall shape of the box plots. If the boxes are similar in length, it suggests similar variability in the two datasets. If one box is longer than the other, it indicates a larger range or greater variability in the corresponding dataset.

Conclusion

In comparing box plots of May and June, there shapes are similar, however, there are more extreme values in June than in May

Post Notes

Relation between Box Plots and Normal Distribution

If we are looking at the box plot of a normal distribution, the relationship is as follows:



Thus the "box" is about 0.67 std deviation on both sides and the "whiskers" are about 2.69 std deviation on both sides. 


No comments: