Data Reservoir and Data Lakes

guidetrip.com

One gets to hear these terms very often from almost everyone and a trending topic in BI & Analytics space today. These buzzwords are quite explicit and refers to vast quantities of data in a single location and in terms of quantity of data, the scale is in Terabytes and Petabytes.

In traditional DW/BI implementations, Fact_Table_Type_1one can see a parallel between data lakes and staging layer of a Data Warehouse. Data entering the Data warehouse is first stored in “as-is” format without any conversions or business logic in staging layer. ETL tools, then apply all required logic and process the data across different layers. Data in staging layer stays in original state, but new and updated data are propagated until the final layer for consumption periodically by ETL tools. Staging layer also does not enforce any schema on the data and even in scenarios where schema needs to be defined, it is enforced minimally. Both data lakes and staging layer essentially mean the same thing, when it comes to their logical usage in an Enterprise Data Warehouse environment.

Term “Data Lake” was coined after explosion of “Big Data” technologies into mainstream IT and are more appropriate in distributed environments. Apart from volume, support for wide variety of data formats i.e. documents, images, videos, XML, JSON etc, is a key differentiator for data lakes. Since there is no structure attached to the data lake, the analytical applications must enforce schema during data processing. A data lake architecture focuses only on data persistence and for example, a bare bones Hadoop based solution from companies such as Cloudera or Hortonworks can qualify as “Data Lake“.

Data reservoir is similar to data lakes, but it tries to address short comings that are inherent as part of data lakes. Anyone who has exposure to BI projects can relate to the saying “Data is never clean“. Big chunk of any BI project is spent on manipulating the data to make it suitable for consumption. These tasks are usually monotonous and when such tasks implemented at the Data Lake itself, a Data Reservoir takes shape.

Data_Lake_Reservoir_1

Big Data is still a new kid on the block and are deployed only for very specialized applications in the industry. We can look at Data Reservoirs as Data Lakes Version 2.0. New innovations are flooding the market and the characteristics of data lakes and reservoirs are in constant state of change with new architecture addressing even more requirements from the industry.