Every day, companies must store and manage a large amount of information, used for purposes ranging from marketing to billing to capacity planning to legal and compliance. The size of data is so large that new prefixes have been invented: peta, exa and zettabyte.
Software suppliers and Opensource communities have addressing the data explosion problem offering different solutions: improved RDBMS technologies, NOSQL and columnar DBs, highly compressed grids and file system archive technologies. And it is out of doubt that the winning solution was Hadoop. Hadoop data lakes have spread out across any enterprise.
The massive adoption of large Hadoop infrastructures has certainty solve the problem to cost effective storage of large amount of data. But this is only one part of the problem. Data is not something has to be stored only, data is a competitive advantage: it is insight on what customer’s like, it can be used to automatize processes and create self-adaptive infrastructure, etc. This means that data have to be stored in effective way but – more important – data must be easily and fast accessed to manage analytics on it.
But how to easy access data over multiple Hadoop infrastructures, merge it with data coming from other databases (f.e. EDW) and manage analytics on it?
Data duplication or consolidation is not a viable strategy: Data is so huge that is practically impossible consolidate them. So the only viable strategy to manage distributed data across the enterprise is the federation of the data.
Hadoop offers mechanisms to federate different instances at HDFS level (f.e. see http://hortonworks.com/blog/an-introduction-to-hdfs-federation/), but these mechanism don’t work when you start to use add products such Hbase, Hive, Impala, etc. Moreover Hadoop federation not work for not-Hadoop archive such EDW, content archives, RDBMS, ..
The solution is a federation layer (API based) that :
- presents different archive instances as one
- allows query data using a simple language (f.e. SQL)
- hides archive heterogeneity (Hadoop data lakes, RDBMS, NOSQL, XML DBs, ..)
- manages extraction of large result data set