“Mean, Median and Mode” is the first topic that anyone encounters when they start learning Statistics. Within BI context, these three methods of measuring numerical data are used extensively but have a slightly different perspective. Although all measures are equally significant, there are some instances wherein one has to be careful because interpretation can be confusing and hence misleading.
Definitions for Mean, Median & Mode are as follows
- Mean is the average of all measure values within a data set.
- Median is the central number that splits a data set into two equal parts based on a specific measure.
- Mode represents the measure that occurs maximum number of times within a data set.
Here, the data set implies a combination of dimensions and a single measure from fact table with filters applicable according to analysis context. Another thing to be noted is that measures refer to cumulative fact data only. Non-additive or semi-additive facts are not good candidates for performing these type of calculations.
Definitions of Mean and Mode are straight forward and signifies a single value for entire data set, whereas Median indicates a center point value that divides the data set into two equal parts.
Mean and Median are both considered center point measures and there are instances where Mean and Median will be equal, but in many cases they are not. In spite of being Center Point measures, the meaning conveyed by Mean and Median are vastly different.
Consider an example of different work orders and time taken for each work order to be completed. A quick glance at the data will not show anything out of ordinary, but in reality this is a skewed data. In this example, the Mean is 27.7 whereas Median is 14.5 and Mode is 14. This essentially says that the average time taken to close a Work Order is 27.7 minutes and out of 20 work orders 10 work orders were closed under 14.5 minutes and remaining took more than 14.5 minutes and the maximum number of Work Orders that were closed by duration took 14 minutes.
This example illustrates the glaring gap between different center point values, which essentially is due to skewed data. In real life situations, a perfect symmetrical data is hard to find and all data sets are skewed either to left or right. Predominant contributor to data skewness are outliers within data set, which makes the data representation with either a long or short tail.
In the example above, one Work Order took 110 minutes to be resolved, which is an outlier among this data set. In cases of skewed data values for Mean, Mode and Median will not match. Representing a KPI in a dashboard should keep this in mind because user interpretation will not be uniform across the organizations. Human brains are hard wired to think in a symmetry and it is easy for end users to perceive wrong values if UI is incorrect.
From SQL perspective, we can use AVERAGE function to calculate Mean and COUNT to find Mode along with respective GROUP BY clause. Arriving at MEDIAN is slightly tricky because a direct SQL function is not available, instead the keyword TOP along with PERCENT to arrive at the Median value.
Median should be used only when user is trying to analyze the data and trying to perform “what-if” analysis. It is more of an “analytic” type of measure whereas Mean and Mode are a bit straightforward.
For more information