Why do you need a semantic layer for your Data lakes

Data lakes is a tremendous opportunity for companies to transform their existing data infrastructure to support digital transformation. But to fully achieve these promises, Companies must make their data fully available and usable to the business without IT intermediation

The Data lakes problem: make the data “usable” by a large number of users

One of the key aspects of the digital transformation is the massive production and collection of customers, products, operational and even external data.

Data lakes (in cloud or in premise or hybrid) allow to collect, store and manage all this data in an efficent and cost-effective way. Data lakes, indeed, are:

  • Low cost repository systems
  • High Scalable (up to Petabytes)
  • Easy to ingest

In a digital world, where data volumes are literally exploding, these aspects are a panacea for IT departments. But Data lakes are just collection areas for files. As the data grows, knowing what and where the data was, has become increasingly difficult.

So, the key problem in using data lakes is how to make data fully available and usable by large amount of users.

To better understand the problem, let use the library metaphora. In a library, the information (the data) is stored in books and the information about the books is stored in the libray card (the metadata). When a user is looking for books with the title containing the keyword “Paris”, he/she just needs to search on library card to find the list of all the books which title contains ”Paris”.

The library card mechanism is extremelly powerfull but has some limits:

  1. When the library has a lot of book, the search can return a lot of results
  2. Searching “Paris” can return both books related to Paris in France, or related to “Paris Hilton”, or related to Paris in Texas.

The same problem occurs with corporate data, for example, the term “Revenue” is used in several departments (multiple results), and has different meaning depending on the department (ambiguity). Let me do a real example: in a company, both Finance and Marketing departments have defined the following metric: “US loyalty member revenues”. Even if this metric seems the same for both organizations, it is not. Indeed, Marketing classifies as US loyalty member revenue all revenues generated by US loyalty member; while Finance includes an additional clause that revenues must be generated in the US. Small difference (revenues generated by loyalty members outside US) but significant. The US loyalty member revenues term has a different meaning for the two organizations!   

Data meaning is contextualized! This implies that business users have to develop a sort of naming convention or syntax which resolve these ambiguities.

The (partial) solution: the data catalog

As saw in the library example, the first step to make the data, contained in Data lake, usable by users is to provide the library card tool. In data management this tool (or service) is called “Data Catalog”.

Data catalog is a completely organized service that enables users to explore data sources and understand the data sources explored. It provides a single source of reference and a simple way  for data consumers to access the data they need, to perform their jobs. It is the entry point of any data scientist or analyst across the organization.

Important: The data catalog doesn’t contain the data itself, it only contains the metadata describing where the data is. 

Data Catalogs also provide collaboration features, such as the ability to annotate data assets, enabling data governance process. Indeed, even if data catalog is technology tool, which can be deployed in standalone manner; it is most effective as a central part of a data governance effort.

So, Data catalogs are a critical element to all data lake deployments because they ensure that data sets are tracked, identifiable, governed and managed.

Unfortunately, data catalogs are only part of the solution. Coming back to our previous example, the data catalog allows to describe and classify the term “US loyalty member revenues” in the two departments, but not to resolve the ambiguity.

To solve data ambiguity, we need to add an additional group of information to our data: we need to know the relationships across our data.

This picture shows how the information richness grows going from data definitions to classifications and to relationships

The full solution: the Semantic Layer

The solution to “data ambiguity” is to introduce a component which bring context to the data. This component is called “Semantic Layer (SL)”.

“A semantic layer is a business representation of corporate data that helps end-users access data autonomously using common business terms. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization” (From Wikipedia)

The semantic layer is a single business representation of corporate data. It contains a clear and unique definition of corporate data entities.

In traditional database, a semantic layer is a set of predefined virtual views (models) that represent a company’s data, which are represented in familiar and meaningful terminologies instead of pure technical fields. It is a business translation layer that sits between end users and the database. It insulates end users from the technical details and the complicated structure of the database.

In other words, the Semantic Layer is an abstraction layer in which corporate entities are defined uniquely both in terms of meaning and in terms of rule to create them. The rules, structures and query languages for creating semantic data elements from a data set is known as semantics, with the whole of the dataset known as a knowledge base.

The picture below shows an example of mapping between data sources tables (right) and business terms (left). The arrows represent the queries to create the business terms.

Example of how a semantic layer is build from data

Important: The semantic layer is just a representation of the data, it is a metadata layer — it does not contain any data. The Semantic layer contains information about the objects in the data source which it uses to generate queries to retrieve the data.

So the Semantic layer allows to solve the issue with data meaning ambiguity.

What Semantic Layer for Data lake

The idea of semantic layer is not new. The original concept was created in the 90’ by Business Objects and adopted by other BI tools vendors in next 20 years.

Traditionally, these semantic layers sit on top of a traditional data warehouse or databases in order to make easier for business to create thir reports. Semantic layer are the main entry point for data access for most business users when they are creating reports, dashboards, or running ad hoc queries. They have always been purpose-built for specific BI visualization tools, indeed all of them are vendor-dependent.

Traditional DWH and Semantic layer architecture

The main characteristics of traditional semantic layer solutions are:

  • Data pre-integration across multiple data sources
  • Joins and relationships are all handled in the data model
  • Columns can be renamed into business user-friendly names
  • Business business logic and calculations are centralized in the data model
  • Time-oriented calculations are included in the data model
  • Aggregation behavior is set to improve reporting tools respond time
  • Formatting is pre-specified to improve reporting tools respond time
  • Data security is incorporated
  • Data can be enhanced by adding Hierarchies, and Calculated Measures, Calculated Members  to improve reporting tools respond time
  • Vendor-specific solution (the semantic layer works only with its own BI tool) 

These characteristics are not sufficient and must be complemented with additional ones. Semantic layer for Data lakes must:

  • be Independent from visualization engine or any vendor specific tool, because they have to be general purpose and can’t be locked to a specific vendor.
  • support for large number of file formats, not only tables and structured data, as Data lakes contain corporate data in both structured and unstructured format
  • be capable to link disparate data both structured & unstructured
  • be general purpose. Its focus is on knowledge management and sharing across the enterprise.
  • support both schema-less and schema-based data model
  • support semantic inference to create additional knowledge on the data
  • support semantic graph visualization to make easier to identify relationship across data
  • support Version Control and Governance, as the semantic layer is the new center of the data governance process
  • have open SDK to consume semantic layer from any client tool
  • Ontologies and data classification can automatically created .

In the case of Data Lake(s), due the huge volume data, it is extremely important that semantic layer tools provides automatic functionalities to build build ontologies.

Next generation semantic layer

As we saw in the previous paragraph, data lakes need a new kind of semantic layer. The table below provides a quick view of the main characteristics of the new semantic layer in comparison with traditional ones.

Traditional semantic layer New generation semantic layer
Vendor-specific solution (the semantic layer works only with its own BI tool) Be Independent from visualization engine or any vendor specific tool  
Support only schema-based data model Support both schema-less and schema-based data model
Data model based Data model and Ontology based
Business logic and relationships are all handled in the data model Business logic and relationships can be handled in data model and in the RDF (Resource Description Framework) model
Reporting and Dashboard oriented General purpose. Its main focus is on knowledge management and sharing
Data security is incorporated Data security is incorporated
Data pre-integration across multiple data sources Physical and virtual data model cross multiple data sources
Structured data only Support for both structured and unstructured data
Database format mainly Support for large number of file formats.
Not only tables and structured data
SQL SQL, SPARQL, JSON, ..
No or limited SDK andAPI Have open SDK to consume semantic layer from any client tool
Support Version Control and Governance Support Version Control and Governance
No support for semantic inference Support semantic inference (new relationships can be created by inference)
No support for semantic graph visualization Support semantic graphical visualization  
Manual or limited automatic creation of semantic model(s). Manual, semiautomatic and automatic creation of semantic model(s)

Conclusions

To solve their usability issue, Data lakes must be provided with semantic data layer(s) to take an inventory of all the key business metrics and collect them in a single abstraction layer where they can be managed and changed in one place. Creating a Semantic Layer is a necessity if an enterprise wants to unleash the power of big data and analytics out to silos organizations. It provides:

  • the ability to navigate across the information in the company
  • a way to see important records and to understand the way they relate to other relevant information within their own contexts
  • A way to create new relationships across the data or hidden facts using inference  

However, Semantic layer is far from trivial to construct. It will take a strong organizational commitment to build it, but it worth the effort.

Related article