Data Binning

clevamama.com

A simple, yet powerful concept used during data analysis that categorizes the data into different buckets/bins. One important point to remember is that “binning” and “clustering” are not the same and vary considerably in logic and implementation.

Data_Binning_1

Binning is a quick and dirty way to start analysis on data set. It is applied only on numeric data type and can be implemented on either dimension or fact data. Classifying customers based on age or income bracket is an example of binning on dimension and analyzing sales order values based on different quantity or value ranges is an example of binning on fact data.

A binning operation result always is the count of records in each data bucket defined, which helps in analyzing the distribution of a numeric data set. Based on the distribution we can decide on the types of ranges that can constitute different buckets/bins. One can define bins equally based on percentile values in a distribution or define ranges arbitrarily.

Binning is always a database delegated operation because of its inherent usage and application. Consider a scenario where customer salary range is binned similar to example above. Here, one must ensure that different customers are allocated different bins first before results are aggregated at reporting layer. The concept of delegating to the database applies for both Dimension or Fact data based binning.

Binning ranges can be defined beforehand and used in analysis or computed at run-time. Tools in the market today provide option to perform on the fly binning of data. Such analysis is done extensively by Sales & Marketing department to understand customer(s) and use their understanding in Segmentation, Targeting & Positioning of a product.

Couple of posts have already introduced this concept although in subtle manner