How to expand the life of your Data Warehouse using Hadoop and Cloud

Hadoop and Clouds are tremendous opportunities for Companies to transform their existing Data Warehouses (DWH) in big data analytic infrastructures. But to fully achieve these promises, Companies have to fully exploit Hadoop capabilities: from pure data lake to data crunching and manipulation.

The traditional enterprise datawarehouse (EDWH) model

It is well-known that during the last decades, companies have been building Enterprise Data Warehouses (EDW) to analyze business activities, produce reports and dashboards helping decision makers to assess and improve the performance of their business. This traditional environment has been mostly based on a common schema: data from multiple operational systems are loaded into the Data Warehouse Server, where they  are cleaned, transformed and integrated by ETL (Extract, Transform and Load) tools. Afterwards, they are exploited by BI and visualization tools. The overall schema is shown in the following picture:

Even if EDWHs have been extremely powerful and successful environments for more than three decades, they have strong limits:

  • The DWH engines are generally based on relational databases (RDBMS). Such databases have limited horizontal scalability, lack in storage optimization (e.g. both data and indexes have to be stored, data reduction as in columnar DBs is not feasible …) and generally don’t have integrated statistical libraries to manage complex analytical models.
  • Most part of the DHW archive volume (80-90%) is used by raw data (staging tables). Only a small portion of the DWH is actually used to manage analytic data. This makes DWHs extremely expensive when the data volume grows.
  • As ETLs use their own datastages, separated by DWH ones, they require a lot of additional hardware resources. Moreover, as ETL-based transformations require data moving between ETL datastages and DWH, the network introduces high latency in the whole system.
  • Collected data must be structured to be loaded in DWH. This means that semi-structured and unstructured data must be heavily manipulated before they can be loaded in the systems.

Recently, the explosion of data traffic, the proliferation of technologies producing data and the need of more complicated and sophisticated data model have shown all these limits, and put in doubt if DWH can be further evolved to support the big data era.

The big date era: the advent of Hadoop as Enterprise data store

Despite the technical limits of the existing DWHs, most part of the Enterprises hasn’t replaced them. Companies have preferred to adopt strategies focused on preserving the big investment in DWH infrastructure. Existing EDWHs have been “flanked” by large Hadoop clusters, storing the huge volume of data collected every day by worldwide companies. The Corporate DWH infrastructure has become a hybrid infrastructure: the picture below shows this new architecture: traditional data are managed in the traditional way; where new data are stored in Hadoop data lakes connected to the DWH engines.

Hadoop clusters are used as “data staging” areas of the “big data”.

This strategy addresses two main problems: the reduction of archiving costs and the simplification of the acquisition of new data formats – produced by internet and IoT world. However, the resulting architecture is still far from being the Enterprise analytics infrastructure needed by future full data-driven companies. Indeed, most IT executives still complain about the fact that they are not able to fully leverage the huge quantity of data stored in Hadoop clusters.

The main reason why companies are not able to leverage data in Hadoop clusters is that this model does not address the problems related to ETLs and the need for easy mechanism to transform unstructured into structured data.

Also ETL is not a real analytic function, ETL’s are a fundamental function in analytic world: it allows to define the sequence of how the data have to be manipulated. In other words, it “populates” the data model with transformed data. The following diagram illustrates the elementary ETL process:

Many ETL limitations come from bad practices (monolithic jobs, hard coded transformations, output oriented approach …), but ETLs have some infrastructural limits that are stretched in the case of large volume of data. These limits are:

  • Data transformations are done in the ETL datastages, so data are constantly moved back and forward from ETL and DWH. The impact is not only the need of more hardware to run ETL, but – even worst – also the increase of elaboration time due to network bandwidth limitations and latency.
  • ETL data-transformation capabilities are mainly SQL-based. This means limited capabilities to:
    • extract structured data from unstructured text-based files, such as application and system log files, web logs, systems traps, etc.
    • create analytical model based on statistical and analytic functions such as recommendation models, market basket analysis, clustering, etc.
    • crunch data

Exploit the full use of Hadoop

In order to address these issues, customers don`t need to introduce new tools, but only learn how to fully exploit Hadoop features.

The Hadoop Distributed File System (HDFS) has been designed not only to reliably store very large data sets, but also to query out data to user applications as fast as possible. To achieve a more efficient data stream, Hadoop prevents data transit over the network, whenever possible: instead of moving the data from the storage layer to the application layer (and thus consuming bandwidth),  processes are pushed to the storage layer, closer to where data are physically stored. Processing tasks are performed into the physical node where data reside. This  Map-Reduce mechanism significantly improves the network I/O patterns, keeps most of the I/O on the local disk or within the same rack, and provides very high aggregated read/write bandwidth. Each process (MAP) works with locally-stored data whose results are grouped by REDUCE process.

Furthermore, since Hadoop has been designed to efficiently manage unstructured data,  Map-Reduce has extremely flexible and powerful capabilities for text processing and manipulation.

These features show how Hadoop can also “flank” ETL tools. Map-Reduce or Hadoop`s higher level languages such pig, cascade, crunch or even Spark, can be used to create data elaboration pipelines which can replace ETL jobs. These languages have additional advantages on traditional SQL-based approach used by most part of ETL tools:

  • Hadoop languages are fully procedural and easily fit in the pipeline paradigm. E.g. Pig Latin scripts allow to describe not only pipelines but also direct acyclic graphs.
  • Hadoop languages allow to add user code (f.e. java, Python …) at any point in the pipeline.
  • In Hadoop languages, during transformations, intermediate data can be saved in Hadoop itself. This provides two main advantages:
  • They avoid the creation of monolithic transformation (ETL developers tend to develop big job to avoid data transfer to/from DWH and related big latencies)
  • They simplify the use of the output of one job as input of another job, making the transformation processes more atomic and easy to maintain.
  • They are designed to easily import/export in multiple file format, simplifying the creation of multiple pipelines.
  • Powerful text manipulation, intermediate data storage and elimination of big latencies make Hadoop languages a powerful tool for data crunching.

The following picture shows how new hybrid DWH architecture would look like: data in DWH staging area is exported into Hadoop, where it is processed and transformed before being loaded back into the DWH system to create Datawarehouse core tables (data normalization and/or dimensions) and Data marts. ETL will still be used to create and manage core DWH transformation and Data Marts and for the existing jobs that don’t need parallel processing or heavy data crunch, as the replacement of these jo s should be not cost effective.

People may argue that Map-Reduce programming is much more complicated than SQL-based one. That may be true but the advantages of the use of Hadoop to manage data crunching and transformation are much higher than the use of traditional ETL tools. 

Summary

To expand the life (and the investments) of existing DWH, Enterprises can introduce Hadoop in their existing infrastructures, not only to store data but also as data crunching and transformation.

To go beyond traditional DWHs limits, the role of Hadoop must be expanded beyond the storage: Hadoop should be also used to manage data crunching and most part of ETL jobs.