Data Ingestion – Quick Intro

“Data Ingestion” is a term that has gained popularity in recent times and especially within the context of “Big Data”. Data Ingestion tools act as the primary interface that populates data into a data lake (e.g. Hadoop Cluster). Data Ingestion process has two distinct functionalities, wherein the first functionality focuses on connecting to various sources of data and fetching them and second functionality is the automation of many routine data manipulation tasks. Essentially, the data ingestion process keeps the data up to date for analysis needs.

Couple of popular tools are Apache Kafka and Apache Sqoop and there are many more available in the market.

From BI perspective, Data Ingestion can be seen as an equivalent to the tasks performed by an Extract Transform and Load (ETL) tool. One can easily draw a parallel between ETL and Data Ingestion, but there are nuances between the two even though both concepts are dealing with raw data acquisition and processing.

ETL tools have origins in structured reporting scenario wherein the tools predominantly focused on Relational Databases and Files as data source. ETL tool took a structured approach to the data pipeline for data warehouses. As technology became mature, ETL tools also evolved to provide a variety of options for data acquisition and processing but at the core is still seen as an approach for structured reporting scenarios.

Data Ingestion, on the other hand was a need that came out of “Big Data” technologies. In an unstructured and a random environment, ETL tools could not bridge the gap of loading vast quantities of data. A typical case point is data pertaining to IOT, where large volumes of data are generated every second. Concept of data ingestion turned the tables and addressed the need for processing voluminous, unstructured and format insensitive data. The architecture of data ingestion tools have a combination of features seen in Middleware and ETL tools. One main drawback in current market place is the lack of a single tool that can fulfill all analytic needs, instead is fragmented with various specialty tools.

Pull vs Push is the key differentiating factor between ETL and Data Ingestion. Historically, ETL tools relied on batch mode of processing wherein jobs were triggered based on time or event in order to kick start the data loading process. Since “Big Data” has different characteristics, apart from batch more, streaming and real-time processing becomes critical.

Future Outlook

Convergence of ETL and Data Ingestion tools have started and features available in Data Ingestion tools are available within ETL tools. For example, processing JSON data or connecting to web services or pushing data into a data lake versus data warehouse are now out of the box features in many ETL tools. Managing and processing real-time data feeds are also a reality now when compared to batch mode only processing framework.

Organizations that are exploring using data lakes in conjunction with data warehouse can leverage a single tool for all ETL & Data Ingestion activities.

Further Reading

Link: Top 15 Data Ingestion tools by Predictive Analytics Today