Blurring Lines: OLTP vs OLAP – Part 3

In recent years, several technologies have emerged that are slowly reducing the gaps and paving way for a consolidated system. Advancements in hardware and software areas has triggered this disruptive phenomenon.

Distributed Computing (MPP/SMP)

Parallel Processing methodology breaks down a large task into multiple sub tasks and execute them in parallel. Results of each sub task are collated at the end and presented to the requester. The idea might be simple from performance improvement point of view, but logistics of managing the execution required highly specialized control and operational logic. Server_RackSince there is a physical limit to how big a single server can grow in size, BI systems started implementing distributed architecture in a simple way using clusters. Cluster computing helped in scaling up easily, but were more of load-balancing in approach rather than pure play parallelism.

There are several architectures available in the market today such as Massively Parallel Processing (MPP), Symmetric Multi Processing (SMP), Hadoop & Spark Framework etc., Each architecture has its own merits and appliances provided by vendors in the market today have adopted a variant of one such framework. One common thread among all these architectures is the capability to perform large and complex tasks over a large network of nodes i.e. specialized as well as off the shelf hardware.

In recent years, multiple cores per processor and multiple processors per board has become common place. This packs a punch in terms of computing power and shrinks multiple systems into a single node that can easily handle multi tasking type of jobs. Hardware level improvements as well as the software to implement control logic are how working synchronously “in-memory” mode.

Data Virtualization

A quick introduction to how data virtualization works can be found here.

Two types of virtualization concepts are being implemented today. First methodology mimics the functionality of ETL tools and physically captures data. A background process regularly fetches data from different sources and applies business logic, if any and updates the target data store. This approach eliminates the need for an ETL tool and integrates the capability into main appliance.

Second and more powerful method of access is to perform all actions in real-time where the data from different physically separate systems are logically modeled as a single unit. Any query on the logical model triggers separate sub-queries that will fetch and consolidate data in real-time basis to the user. This approach is preferred when the source data changes frequently and real time access is required.

Depending on the task at hand, appliances today even provide option to choose between the two approaches for data availability.

Epilogue

The four triggers namely In-Memory Computing, Row vs Column Store, Distributed Computing and Data Virtualization have impact beyond Business Intelligence space.  Vendors in the market have included support for open source technologies along with their proprietary technologies. Market potential is huge because it is in very early stages of adoption. Self service BI and advanced visualization tools are finding good traction with business analysts, who are inclined towards ad-hoc analysis over large data volumes. It is estimated that within the next decade almost 70% of users in an organization will access some form of BI or Analytical tool as part of daily routine.

The only barrier is cost, both in terms of technology hardware and manpower skill set. Hardware is expensive to procure and the project implementation costs are high due to lack of trained people in this arena.

NOTE: SAP HANA is a good example of this disruption and a similar system from house of Oracle is the Exalytics In-Memory machine. IBM and HP have their own product lines too in this arena.