Partitioning, as the term implies is a divide an conquer method of handling data. It is one of the most used performance improvement measures in a data warehouse. Data as a whole, logically appears to be grouped as a single dataset, but technically is managed as multiple independent chunks. This method of managing data has been very successful and found its way as default option in many latest data technologies such as MPP systems, NoSQL databases to name a few.
The predominant type of partitioning employed is referred to as “horizontal” partitioning, wherein the data is broken down into manageable pieces based on a pre-defined criteria. The table lists out most popular partitioning methods in relational database scenarios and big-data tools use similar approach with modifications to the methods used in traditional relational database environment e.g. Hadoop uses Key-Value pair based partitioning which is a variation of hash partitioning.
|List||Static List provided with values, based on which the number of partitions are managed. Data that does not match the list are stored separately.||Year (2015, 2016 etc)|
Location (Americas, EMEA, APJ etc)
|Number of records matching criteria defined|
|Range||Static List provides with from and to values. Number of partitions depend on the list size and default range is specified for data that does not match criteria.||Year (2000 - 2005, 2006 - 2010, 2011 - 2099 etc)||Number of records matching criteria defined|
|Hash||Internal algorithm decides on partition. Ensures that all partitions are of even size.||Equal size by algorithm|
|Round Robin||Similar to hash partition in all aspects except that the algorithm uses sequential method to manage partitions||Equal size by algorithm|
|Composite||Mix of two partioning methods. First method will act as primary and within the primary partition, the secondary method is used.||Range List|
Primary: Year (2000 - 2005, 2006 - 2010, 2011 - 2099 etc)
Secondary: Location (Americas, EMEA, APJ etc)
All tables are not partitioned by default because there are overheads related to data storage and manipulation of partitioned data. Tables that cross millions of records and used extensively in reporting are only considered as candidates for partition. Systems also have an upper limit to size of data in the table, such as record count or memory size before enforcing partition (e.g. SAP HANA recommends 2billion records as the upper limit)
The partition function and scheme are designed based on careful analysis of source data flowing from the ETL pipeline and SQL execution plan analysis of data fetches for reporting tools before finalizing the fields/columns that will act as the base criteria for partition. Creating, changing or removing a partition is one way street and requires a full drop and recreate of the actual target table as well as full data re-load. Once data is populated, the performance in reporting side makes sizeable improvements due to decreased latency in data fetch. Another advantage to partitioning is that one can perform certain database level operations such as TRUNCATE a specific partition without impacting data in others
Predominant method of partition is a table level physical partition, which creates and maintains the number of partitions based on function and scheme the moment a table is created and associated indices. Another method of partitioning is via Indexes on Tables rather than the table itself. This is a different approach compared physical partition because only a partitioned index is created without physically splitting the table. This is very useful when a table cannot be re-created and data re-populated, but in general it is not common practice.
All partitioned tables will have the partition criteria included within indices in order to make the seek and fetch faster. Creating partitions is the first step, but in order to maintain efficiency one must monitor partitions regularly and prune partitions in order to keep the size of data in check.