How to exploit the full use of Hadoop and Data Lakes – Part 2: How to create Hadoop-Friendly data schema

Introduction

As described in my previous article (How to exploit the full use of Hadoop), Hadoop is an easy and cost-effective platform to store and manage any kind of data (structure and unstructured) in schema free models. But despite the schema-free nature of Hadoop, it is always critical defining a data storage strategy to load data into Hadoop. If data loading doesn’t respect the Hadoop internal mechanisms, it is extremely easy that your Data lake will turn into a Data Swamp

Create Hadoop-friendly data schema means defining the follow aspects:

  1. Denormalized Data schema. Yes, it is strongly recommended to use data schema even in Hadoop. Its data distribution mechanism is very powerful but can be extremely inefficient if not used properly.
  2. Storage File formats. Hadoop allows to use numerous data formats, but some of them are more optimized for its storage and processing.

Step 1 – Denormalize your data

Hadoop is an High Distributed File Systems (HDFS) and it distributes data in its nodes for both storage and elaboration. To reduce data transit over the network, Hadoop pushes data elaboration to the storage layer, where data is physically stored. Each Hadoop’s node has both elaboration and storage capability. Any data elaboration is organized in two phases: Map and Reduce. In the “Map” phase each node works with its locally-stored data, while in the” Reduce” phase all results are grouped together.

The key assumption of this model is that each node mainly works on its locally-stored data and the “reduce” phase is only used to aggregate and sort the data.

This assumption is not respected by traditional data models (Data-Marts, 3NF …), where any information record is stored in “fragments” across several tables (“data normalization”). Only at a query time, the full information is re-build by join the tables.

In Hadoop, as data is highly distributed, to do a join, a node may need data stored in other nodes. A typical Hadoop join requires moving the data across the network at least twice. Instead, the secret to use Hadoop is to use pre-joined tables (“denormalized tables”) so data transfer is minimized. Indeed, in denormalized tables, the join process is just a query and filter, as shown in the picture, so data transfer over the network is minimized 

Data schema denormalization can be performed in two steps:

  1. Collapse the dimensional tables and
  2. Flatten the Fact, Dimensional and Reference tables in one or few tables 

as shown in the following picture:

Collapsing dimensions

Collapsing a dimension means that all attributes belonging to the dimension itself but hierarchically distributed in different tables are moved in a single structure.

The following example shows how time dimensions can be collapsed in a single table.

Flatten the schema

Flatten a schema means creating wide tables (“denormalized tables”) that combine all fact and dimension table columns together. This approach will guarantee that data will be co-located on the same node by HDFS system.

Denormalized tables are therefore composed by one or more fact tables, their dimensional tables and all their reference information.

Denormalized table = Fact Tables ⨝ References tables ⨝ Dimensional Tables

In the following picture is shown an example of denormalization of one fact table and its dimension tables. 

Step 2 – Storage file format

Even if Hadoop allows the use of numerous data formats, some data formats are more optimized for Hadoop storage and processing. The use of the appropriate file format, can provide strong performance improvements. In the case of Datawarehouse (DWH) and Analytic applications, the use of a columnar file format (e.g. Parquet) can be a real game changer in terms of performance and flexibility.

Hadoop columnar file formats provides two main benefits:

  1. Data is stored by columns rather than row
  2. Data can be arranged in complex and nested data structures.

Columnar file format

In traditional row-oriented DB data is sequentially organized and stored by row, while in a columnar file format, data is sequentially organized and stored by column. 

The first advantage relates of the columnar approach is that to perform a read you don’t need to read the whole table, but just the columns you need for your operation, as shown in the picture:

This drastically optimizes Hadoop workloads, as it reduces the I/O needed to perform “reduce” operations. Columnar file formats are better suited for DWH and analytical elaborations as most part of DWH operations (e.g. aggregation, grouping, average …) are all column-based.

The second advantage is that these file formats allow better compression. Generally, the compression factor of data in columns is higher than data in rows, since the data tends to be more homogeneous across columns rather than rows

Using specific columnar format as Parquet allows to obtain additional advantages.

  • Parquet stores full metadata at the end of files, so its files are self-documenting.
  • Parquet stores some column information (e.g. count, average …) as Metadata, so some operations can be performed just reading the metadata and not the whole column.
  • Parquet file format support both serialization and sequencing. This is extremely important to optimize map/reduce or Spark workload.
  • Parquet uses efficient and extensible encoding schemas – (e.g. bit-packaging/run length encoding).
  • Parquet can be read and write using Avro API and Avro Schema.
  • Parquet provides “predicate pushdown”, thus reducing further disk I/O cost.
  • Parquet is designed to support complex and nested data structures (see next paragraph).

Complex and Nested data structures

Parquet file format allows to use nested and complex structures as element of the data schema. This means that it is possible to create and use data types like:

  • Complex structures (Struct and Struct of Structs) :
  • Tuple or Array of complex structures (Array of Structs and Map of Structs

(For more information on how to create struct in Parquet, please refer here).

Complex and Nested data structures provide the following benefits:

  • They facilitate the creation of denormalized tables.
  • They allow to create more “natural” data objects. For example, It is possible to use a single column to store “address” object instead of using 5 columns.
  • Complex and Nested data structures allow to integrate different data granularity in the same table (e.g. to merge two fact tables). Higher granularity can be represented as a tuple or a table inside a cell of the main table.
  • They allow taking advantage of columnar data formats for all data elements. Indeed, storing the whole object in a single column is more efficient from query point of view: All data in the structure is stored as adjacent in the same block. 
  • They allow optimizing query performance for scans and aggregations at the expense of optimizing for storage costs.
  • ·They reduce the amount of pre-processing required to query data in its “natural” form.
  • They can be queried using standard SQL statements.

Warning: Nested data structures can be very powerful mechanisms to speed up queries, but if abused, they produce the opposite effect. Generally speaking, avoid having more than 1 or 2 level of nested structures. 

Conclusions

As seen before, Hadoop moves the complexity from the loading phase to the query one. Such complexity can be removed or limited using data schema optimized on Hadoop storage and elaboration principles (“Hadoop-friendly” data schemas).

The two main criteria for creating these schema are: