As a parallel concept to Schema On-Read and Schema On-Write, one has to first understand difference between structured and un-structured data. These terms became a part of common parlance after big data tools (Hadoop, Map Reduce etc) exploded in the scene. Fundamentally, we are trying to differentiate two diametrically opposite methodologies of storing, manipulating and analyzing large volumes of data.
In the evolution of data storage methods used by applications, one can count three distinct phases. The concept of schema was introduced by Relational Database Management Systems (RDBMS). The need for a clearly defined schema along with normalization of data stemmed from transaction processing applications. RDBMS systems ensured that transactions were ACID (Atomicity, Consistency, Isolation, Durability) compliant, which enabled very fast adoption by commercial organizations. (There are other database architectures such as Object Databases, Hierarchy Databases etc, but none of the architectures have such a widespread adoption akin to Relational Databases)
After collecting vast amounts of data, organization realized that the traditional transaction processing systems had several bottlenecks when it came to analytical needs. A new field called Data Warehousing & Business Intelligence was born that addressed such concerns by building Online Analytical Processing (OLAP) systems using Relational Databases. Key thing to note here is that underlying data management architecture remained the same.
Schema On-Write was the standard practice where the data warehouse model is defined well in advance. ETL tools performed bulk of work pertaining to fetching, cleaning, applying business rules and ensured that data is written to tables exactly as per predefined structure. Adopting a schema on write approach ensured that all data wrangling is done well in advance and final data that is available for analysts are clean and formatted. Essentially Schema On-Write concept enforced structure on incoming data into the warehouse even if the source is providing unstructured or semi-structured data.
Schema On-Read methodology basically eliminates ETL layer altogether and keeps the data from source in original structure. Basically entire data is dumped in the data store, where the system manages storing data across entire landscape. All Big Data and NoSQL tools follow this approach and just split the files across all processing nodes. During information retrieval, two elements are required i.e. Schema Definition & Processing Logic. When a user wants to analyze the data, he/she creates a schema that is fitted on the data at run-time and processing logic performs the data manipulation part with reference to defined.
Each approach targets different types of analytical scenarios. In today’s context, DW/BI implementations feature both methodologies and both types systems co-exist.