Data (Model) Normalization

bigdata.fpf.org

Normalization is a technique adopted to either manipulate data or data model based on the type of system and analysis needs. There are subtle differences in normalization methods applied between Data Processing vs Data Warehouse vs Data Analytics requirements.

http://tips4ufromsony.blogspot.in/
ER Model

In classic Relational Database (OLTP) scenario, wherein transaction processing systems are designed, normalization is implemented in the data model. Boyce Codd Normal Form (BCNF) is the gold standard in modeling data using Entity and Relationship method which has stood the test of time.

Normalization principles ensures ACID properties are enforced, which is critical for overall integrity of system. These principles, popularly known as 1NF, 2NF, 3NF etc., are a sequence of activities that gradually breaks the data set into manageable table structures. The end result is a data model which adheres to all criteria and enables Atomicity, Consistency, Isolation and Durability of all transactions.

Dimension Model

In Data Warehousing / Business Intelligence (OLAP) context, it actually refers to de-normalization of the data model.  The process of designing a de-normalized model is simple when compared to relational normalized database. Although many design principles are carried over from Entity-Relationship modeling (e.g. Definition of Keys, Constraints, Relationships), the contextual application is slightly different.

Modelling activity is simple when compared to transaction processing systems, but the complexity lies in maintaining consistent view of Entities (i.e. Dimensions) and Activities (i.e. Fact). Entities (strong or weak) and attributes are typically integrated into the dimension and all transactions are combined into fact. Finally, a data model popularly known as star schema emerges, which forms the base for storing all data. ETL tools play a major role in manipulating and populating the data correctly due to de-normalized structure of this model. One key item to note here is that ETL does not manipulate actual fact data in any manner and reconciliation between original source and data warehouse should always be a 100% match.

In Analytics viewpoint, normalization techniques is applied on actual data rather than model.  A “model” here refers to the algorithm/technique (e.g. Regression, Classification, Machine Learning etc) employed for performing data analysis. Primary focus of normalizing data is to improve a model’s accuracy.  Data normalization is a data engineering activity also referred to as Feature Selection and Engineering which is typically performed at the fag end of data pipeline.

As the old saying goes “Garbage In Garbage Out”, a model’s accuracy squarely depends on two key factors to arrive at the right sample set. First, the Feature Selection identifies right variables and secondly, Quality of Data ensures that variables contain qualified data for analysis. During data normalization process, each variable is profiled for completeness and transformed based on rules. Again, the best rule to be applied depends on various factors and are chosen based on many factors.

Some popular types of data normalization are listed below. A normalized variable is used by the model against original variable. Based on models output and accuracy, the normalization rules may be changed.

  • Min Max: A variables minimum and maximum values are taken as baseline of 0 and 1. All values are factored within this range by calculating the difference between baseline.
  • Outlier Elimination: Data points that cause huge swing in model output are flagged to be ignored. Standard IQR calculation helps in easily identifying outliers.
  • Standardization: Calculated based on Mean and Standard Deviation showing the distance between median and variable value.
  • Logarithmic Scaling: Applying log() function to scale large values to manageable form.
  • Extrapolation and Interpolation: Deriving new values to fill gaps or create a new feature based on other dependent variables