How to exploit the full use of Hadoop and Data lakes – Part 1: Overview

In the last decade, the adoption of Hadoop Distributed File System (HDFS) has been without pace. Today, we can definitively say that almost any IT department in the world is using Hadoop in a way or another (in cloud or in premise or sometimes in both).

The great success of Hadoop is due to three main aspects:

  1. Low cost repository systems – HDFS runs on standard PC hardware
  2. High Distributed – Hadoop is scalable up to thousands of nodes and up to dozen of Petabytes of storage 
  3. Easy data ingestion – in Hadoop, data can be loaded as it is, without any data schema (Schema-on-Read).

In a world where data volumes were literally exploding, these aspects were a panacea for IT departments. Even a new term was invented to describe these new systems: Data Lakes.

But with the time, Companies observed that querying and extracting data from the lakes was more complicated and costly than expected. As the data grown, less and less qualified people were able to know what data is and where in the lake. Data was practically not accessible or usable by users. Data Lakes were becoming Data Swamps. What happened?

Data lakes have usability issues

Paradoxically, the problem is due to the Hadoop’s strengths: if it not properly used, loading data without schema and high distributed file systems can easy create huge usability issues. Let’s see why.   

Load data in HDFS is extremely easy. It just a copy of your files. You don’t need to create any data schema; the schema is created only at query time (read) by the query itself. This approach is called “Schema-On-Read”. On the contrary, in traditional DBs or DWHs, data is loaded in predefined data schema (tables) that must be created before the loading phase. That’s the “Schema-On-Write” model.

In a traditional DB when data is loaded, the systems automatically uses the data schema to create or update Metadata information such data catalog, dictionary… In Hadoop, there is not schema, so Metadata can’t be automatically created by the systems itself. It must be created by a parallel process in which data schema and data meaning are provided. If this process is missing, there is no way to know where data is and what it means. 

The absence of data schema generates another problem: to extract data you can’t use a schema-based language such SQL; instead you need a Hadoop-dedicated programming language such as PIG, Map/Reduce …

To address these issues, since the beginning, the Hadoop community created a vast set of SQL engines and tools (e.g. Hive, Impala, Drill, Presto …) to provide RDBMS capabilities on top of HDFS. With these tools it is possible to operate with Hadoop as it was a traditional DBs and DWHs (schema-on-write).

RDBMS capabilities solved the Hadoop usability issues, but at a price: very poor performances. RDBMS approach doesn’t perform well due the distributed nature of Hadoop systems.

When Users load their data, HDFS automatically split it in large chunk files and distributed them in the lake’s nodes. Data distribution (replication, distribution, backups, …) is managed automatically by the system.

To reduce data transit over the network, Hadoop pushes data elaboration to the storage layer, where data is physically stored. Each Hadoop’s node has both elaboration and storage capability. Any data elaboration is organized in two phases: Map and Reduce. In the “Map” phase each node works with its locally-stored data, while in the” Reduce” phase all results are grouped together.

The Map-Reduce mechanism significantly improves the network I/O patterns, keeps most of the I/O on the local disk or within the same rack, and provides very high aggregated read/write bandwidth. But the key assumption of this model is that each node works on its locally-stored data mainly. Data transfer across nodes must be prevented as much as possible. MR programs must be optimizing to avoid data copy from one node to another to execute Map activities.

This assumption is not respected by traditional data models (Data-Marts, 3NF …), where any information record is “fragmented” across several tables (“data normalization”). The information record is “re-build” a query time joining the tables. Fast joins are mandatory for any efficient RDBMS. For this reason, in RDBMS, tables are physically co-located. In Hadoop, data is highly distributed. To perform a join, a node can need data stored in other nodes. Copy data from one node to another means transfer data over the network and this slows down the system performance and reduces the HDFS parallelism.

How to improve Data Lake usability

As seen previously, Hadoop moves the complexity from the loading phase to the query one, and this reduce its usability to extract information. SQL tools can be a mechanism to reduce this problem, but to be efficient they must use new data schema paradigms “respectful” of Hadoop internal mechanisms. They must use Hadoop-friendly data schema.

The key aspects of a Hadoop-friendly data schema are:

  1. Avoid highly normalized schema, create data schema with few large denormalized table. This approach will guarantee that data will be co-located on the same node by HDFS system.
  2. Use columnar data format (i.e. Parquet) instead of row-base one. Columnar data formats are better suited for DWH elaborations (e.g. aggregation, grouping, averages operations are all column-based) and provide additional benefits such compression, complex nested data structure, data serialization …
  3. Use complex and nested data structures. This allows to store all data as a set of discrete column elements in a single table. Moreover, it eliminates the need of large-scale joins, facilitates the creation of denormalized tables, allows to integrate in the same tables different data granularity…
  4. Never replace records, always append. All tables in Hadoop should be considered as sort of Slow Changing Dimension (SCD) type 2 table.
  5. Don’t forget Metadata. Create a process to update the Metastore every time data is added, cancelled or updated in Hadoop.