Very Large DataSet Loading

Data Warehouses are designed to handle large volumes of data. Two types of loading mechanisms are typically adopted. A full load or incremental load into Fact table does not have huge impact in performance if data volumes are low. In scenarios where total loading time exceeds 12 hours, the ETL batch job will interfere with reporting. Data availability for reporting is a major criteria to ensure successful BI adoption.

Very Large DataSets loading scenarios are typically encountered in reporting layer wherein at times, truncate and load needs to be performed for fresh load or rebuild of data. In such circumstances, ETL data flow design cannot directly insert or update the target table and the design should guarantee 100% data availability for reporting. Following loading types are used to circumvent this problem.

Partition Based In a physically partitioned environment, databases provide to independently perform database operation at table partition level. For example, if partition is based on 3 zones e.g. APJ, EMEA and AMERICAS, the ETL data flow can be designed to process data in only one partition at a time. This method is not foolproof and hence adopted in scenarios wherein a specific requirement needs this kind of processing.

Schema Based This method is prominent wherein reporting layer is defined as an independent schema and data is loaded is a different schema. All reports point to objects in reporting schema. The reporting schema always points to the actual schema where most recent data resides. Two identical schemas with tables of same structure are maintained oblivious to reporting layer. ETL process loads data into underlying schemas which are switched after each load.

In this method, one schema contains latest data and another lags by one step. After data load process is complete, a sync up job can be used to bring both schemas up to speed, but it is not a mandatory activity. The swap method can be implemented at a table level by quickly renaming the tables, but schema switching method is widely adopted.



The base assumption is that underlying data model does not have table dependent objects like Triggers, Foreign Key Constraints etc and in both scenarios database statistics needs to be rebuilt to help in performance irrespective of table of schema swap methods are adopted. 


SAP BW has a process type to delete overlapping requests in DTP to achieve this functionality