Columnar Storage: Impact in BI (Part 1)

en.wikipedia.org

There are two data storage methodologies when it comes to structured data and more specifically to Relational Database concept. They are “Row Store” and “Column Store” mechanisms, which are drastically different architectures of storing and retrieving data in a database management system. Row vs Column Store is a debate running for a long time and there are specific use cases for both mechanisms. Column store found its niche in large scale analytical applications and is being rapidly adopted in DW/BI space.

Columnar storage method has been in the market for quite some time.Row_Column_1 Let us take an example of a simple table with data in rows and columns. In a database management system, each row is identified by an unique value, RowID in this example. This is not visible directly to the user though, but acts like a pointer to the actual data location within the file system and is used for all DML tasks. The basic difference is how the data is organized within the persistent storage (i.e. non-volatile storage that does not lose the data when power goes down), which looks like below.

Row_Column_2It is a simple switch of how the data is organized and stored in underlying storage system. This method of grouping along columns inherently comes with its own set of advantages that gives an edge in large DW/BI landscapes.

Row_Column_3Compression: Unlike the example above, in today’s columnar storage repeated values are not stored. In the example above, the values for gender, department etc are repeated multiple times, for which only unique values are required.

Consider a scenario of census data where first names and last names of a large 100 million population needs to be recorded. Since names are not unique to every individual, one can store all unique names separately and link  the names to each individual. This reduces total memory required to store the same quantum of data and thus reducing data footprint.

Additionally, there are many techniques advanced compression techniques such as Run Length Encoding, Integer Packing, LZW etc. to reduce the amount of data stored for each column. Case studies show anywhere between 5x to 50x reduction in memory space required for the same quantum of data when compared to row store.

1 Trackback / Pingback

  1. Columnar Storage: Impact in BI (Part 2) | "Data" to "Wisdom" Journey…..

Comments are closed.