Common area of interest in today’s environment where several articles published by analysts talks about a concept of “Virtual Data Warehouse“, which is also referred to as “Logical Data Warehouse” in many cases. This is referred to as the next logical step in evolution of data warehouse environments (DWE).
Conceptually, this is an extension to “Data Virtualization” from a technical point of view. A traditional data warehouse environment extensively used ETL tools to fetch and update the data warehouse constantly. Consider a data warehouse platform, which holds only the meta data (i.e. definition of data) layer without holding any actual data. Within this meta data layer interface, developer and users alike will have access to enterprise wide data irrespective of the source system or technology stack. Developers design and publish data models & reports for different requirements whereas users consume the data. Virtual “meta data layer” takes care of fetching, applying business rules and processing the data for final reports.
Some key items to be kept in mind while taking the “Virtual” route to data warehouse implementation
- Push down Operations – Processing logic cannot be delegated to the source systems at all times. In many instances partial results have to be collated and processed within the virtualization layer. Consider an example where a rank of Top 10 products for Top 10 Locations is required. The measures can be only aggregated in source system level, whereas the sorting and ranking operations are done in the virtualization layer.
- Data Quality – Traditional Data Warehouse environments strive to be the “Single source of truth” by implementing several Data Quality initiatives. This is pretty much Non-Existent in virtual data warehouse scenarios. One has to embed the entire data quality logic within the meta data layer or ensure at source system is governed very well. A simple example, wherein two source systems use two date formats i.e. ISO (YYYYMMDD) and US (MMDDYYYY), but the report requires Indian format (DDMMYYYY). Here, the date format cannot be harmonized to a single standardized format, instead conversion has to be performed at all times.
- Security – A heavy dependency is created with all source systems involved in the data warehouse environment. Data security can be implemented at different levels for the same user in different source systems. In some instances, DW users will not have access to all source systems e.g. Accounting Executive vs Reporting Analyst to a Financial system. Data security has to be implemented wholly in the Meta Data layer, but when it comes to synchronizing with other systems the complexity increases exponentially. (NOTE: Some applications today bypass this by using a generic user-id as background connection that has full access to the individual systems back end data.)
- Business Logic – Any piece of code written, irrespective of size and complexity is executed every time at run-time. This additional logic adds to overall processing overhead. If requirements are complex and data volume is large, it will drain the system and bring down performance drastically. The business logic implementation is moved to the source systems, the overhead has just be passed on to the transactional system. Irrespective of the complexity of task e.g. Conversion to upper case VS lookup against cross reference tables, every line of code included to implement business logic has a inverse effect of performance.
- Historical Data – A traditional data warehouse does not retire data. All historical data is retained for analytical needs. Transaction processing system on the contrary archive old data and keep only the “near past” information. This is done in order maintain performance of these transaction processing systems. Scope of data retrieval of an analytical query cannot be predicted due to nature of complex reporting requirements. If source systems do not maintain full history, it directly affects the usability of the data warehouse.
- Performance – Data warehouses have very large data volumes by design, in the order few million to couple of hundred million records. In order to keep performance top notch, summary tables are built on top of the warehouse that are used first and based on user needs, actual detailed data in data warehouse is accessed. It should not be a surprise to note that in real world, around 80% of analytical queries access the detailed data. Working with very large data sets in a virtualized environment will have impact on all items mentioned above.
Conclusion: Virtual Data Warehouse is not mature technology for adaptation in today’s scenario. Hardware and application technology has progressed, but there are several tangible and tacit areas such as Data Quality, Real-Time Reporting, Data Normalization, Business Rules Logic, Changing Dimensions and being Single Source of Truth for reporting to name a few that has to be addressed. Hybrid approach is feasible, which is the path enterprise BI is taking currently. In hybrid approach, the data lineage and business logic is analyzed thoroughly and classified into virtual or materialized data. Data to be materialized are replicated using ETL methods and virtual data is accessed virtually.
SAP HANA provides an option to synchronize via data virtualization with other databases using Smart Data Access interface. SAP BW provides an option called Open ODS views that is a virtual definition without actual data. SAP Business Objects implements this concept through "Information Design Tool", which is the meta data layer. All these above tools are in production environments today albeit the scope is limited.