Columnar Storage: Impact in BI (Part 2)

-Link to series: Part 1

ashleyjsaunders.comSeek Time: This property helps in performance gain during both data write and read operations. Column based mechanism stores all values  of a column in contiguous memory location whether it is disk or memory, which enables the values of entire column to be read in one pass. All database management systems manage the data in blocks, which is typically in the range of 4 KB.  When the data is available in continuous blocks as against random block storage, the system has to find out only the starting point for the operation. Based on the starting point and total quantum of data stored, the operations are fast because the whole data can be read in a single pass. (This is applicable for read operations, whereas insert and update operations are handled entirely in different manner)

Row_Column_6Data Transfer: This is an advantage that is gained by virtue of both OLAP queries and columnar storage layout. An analytical query typically selects only a subset columns from a table, which is specific to the analysis needs. Data thus requested by a query can be either processed within the database or transferred to an application server for further manipulation. Think of a scenario where two columns are requested from a ten column table by a report with an aggregation operation. In a row based mechanism, the system has to fetch all the columns of the table and then delete unwanted columns before processing the data,  whereas a column layout can skip the unwanted step. Since analytical queries are always ad-hoc in nature, this inherent capability coupled with seek time will provide good significant performance gains.

wisegeek.comIndexing: In a BI environment, the costliest operation in terms of time consumed is a “Full Table Scan”. In a row-store format, each and every row has to be sequentially parsed in order to fulfill the request. All databases provide an option to create indexes that will serve as an initial lookup point, where the system finds pointer to specific RowID’s of rows to be fetched. These database objects were created and maintained in addition to the actual data table. Performance tuning activities included building indexes on columns that were used frequently in analysis. Indexes also posed overhead whenever data is inserted/updated in the table, whereby all ETL data loads will drop indexes prior to loading data.

In column store format, the additional index is eliminated because each column acts as its own index. Data of each column is stored along with RowID’s because the system has to re-construct the record (tuple) after fetching relevant data. This assignment of data along with RowID is managed internally. In order to speed up query execution time, column databases provide additional indexes, which has better compression and work really well in large datasets. These indexes can be either clustered or non-clustered and multiple indexes can be defined on a column table, which is very similar to standard relational functionality.

moh.ioFilter Conditions: In an typical analytical scenario, a user triggers multiple queries on the data warehouse within a short time span. This is a normal scenario because users slice and dice data at various levels of granularity in quick successions. The WHERE clause(s) in SQL is executed in a single operation because entire data set is an index by itself. Only the relevant RowID’s are re-constructed, which helps in reducing memory and processing bottlenecks.