Data Lakehouse is an exciting, modern open architecture for data management and business analytics that enables you to use all your data, combining the benefits of a data warehouse with the benefits of a data lake. The term and the idea have been around for some time, but have really gained serious traction in the past couple of years when a few key technology advancements like metadata layers for data lakes, new query engine designs and optimized access for data science and machine learning tools, have enabled the Data Lakehouse to move from the realm of theoretical research into the general business use.
The background data
The world in which we live is more connected and automated than ever, generating a myriad of different data sources and a seemingly endless amount of data. To drive new business value, companies need to be able to tap into and exploit these new data wellsprings like video, sound, images, sensor data, media feeds, etc. - and they need to be able to do this at scale. The goal is to collect the data, bring it together, make it available, analyse it and answer new questions that keep popping up.
Back in time when the big data was mostly structured, data warehouses did the job just fine, generating sufficient business insight for the decision-makers to stay in the game. Data warehouse is typically built for BI and reporting purposes, extracting data from the source systems such as transactional systems, operational databases, and line of business applications. It enforces data quality and consistency standards and delivers data in a presentation format that can be used for making decisions. While data warehouse provides easy data discovery and query with straightforward data preparation, it’s not a very cost-effective way to store and analyze semi-structured or unstructured data.
With the emerging need to derive intelligence from unstructured data, the ability to handle these types of data became critical for business - and data lakes are all about storing and managing semi-structured and unstructured data at a low cost and at scale. A data lake is great for streaming and complex data processing and storing data at a single location in an open, ready-to-read format. However, it takes time for that data to be queryable with its quality and reliability ensured. Organizations that implemented data lakes can eventually end up with a data swamp - a dumping wasteland of data without clear organisation.
The majority of companies today utilize a two-tier architecture with the data warehouse and the data lake running side by side, but they are in effect separate silos of information, not connected and using different tools. In a two-tier data architecture, data is first captured in the operational databases and then transferred using ETL (Extract, Transform, Load) processes into a data lake, which stores the data from the entire enterprise in low-cost object storage. Data is stored in a format compatible with common machine learning tools but is often not organized and maintained well. Then, a small part of the critical business data is ETLd once again into the data warehouse for business intelligence and data analytics.
This kind of architecture requires regular maintenance and often leads to data staleness, with multiple systems introducing complexity, security challenges, higher infrastructure, maintenance and operational costs and even more importantly, can delay your ability to access timely data insights.
The best of both worlds?
Data Lakehouse, as already the name suggests, is a new data architecture that combines a data lake and a data warehouse in a single data environment. The effect of this coupling is two-fold: we address the limitations and leverage the advantages of both. What we end up with is low-cost storage for large volumes of data in raw formats, but with the features that empower us to structure, manage and optimise this data.
Data Lakehouse eliminates the need to have both a data lake and a data warehouse while improving data quality and reducing data redundancy, with the ETL processes providing a data pipeline between the unsorted lake layer and the integrated warehouse layer. It acts as a cloud-native, centralised datastore that can absorb all types of data in raw formats and enables users to query the data directly without ETL processes or data movement. Data discovery, metadata management, and data governance capabilities are supported, while access controls and security rules assure data security and integrity.
The main features
As the famous rhyme advises, something new, something old, something borrowed and something blue is the perfect wedding recipe for a happy, long-lasting marriage. What are the features that the data warehouse and data lake brought to the union, and what are the fruits of this bond?
- Transaction support: Data Lakehouse can handle multiple data pipelines, enabling multiple users and processes to concurrently read or write data while maintaining data integrity, and is ACID compliant (transactions are automatic, consistent, isolated, and durable)
- Schema enforcement and governance: data warehouse applies a schema to all data and the data lake doesn’t. The lakehouse structure can reason about the application of schemas and standardize a greater volume of data.
- Business Intelligence (BI) and analytics support: Data Lakehouse is a single data repository, removing the requirement for two copies of the data, speeding up analytics and improving the quality of BI.
- Support for a broad range of data types: from structured to unstructured, data lakehouse can store and provide access to diverse data types such as images, video, audio, and system logs.
- Decoupling of storage and computing: Data Lakehouse separates processing from data storage and uses clusters in a cloud to maximise available resources, lower costs and increase scalability.
- Openness: Data Lakehouse uses standardised and open storage formats, enabling access to a wide range of tools and engines that can work natively with the data.
- End-to-end streaming: Data Lakehouse supports near real-time ingestion of streaming data, thus enabling streaming analytics and facilitating real-time reporting, which is rapidly becoming a must-have capability for a number of enterprises.
- Machine Learning governance: Data Lakehouse allows for the versioning of data, and data scientists can use this to repeat and compare model builds as an increasing number of organisations start to leverage Machine Learning (ML) at scale.
The problems solved
Data warehouses and data lakes exist side-by-side in many companies without any major issues, but the Data Lakehouse can help address several challenges that commonly arise with that two-tier setup:
- Data reliability: with multiple copies of data to keep in sync, data consistency is extremely hard to achieve. There are multiple ETL processes and each additional process introduces added complexity, delays and failure modes. By eliminating one tier, the data lakehouse architecture removes one of the ETL processes, while adding support for schema enforcement and evolution.
- Data duplication: when an organization has both a data lake and several data warehouses, this is will create redundancies that are inefficient and may lead to data inconsistencies. Data Lakehouse is a unified, single version of truth for the organization.
- High storage costs: both data warehouse and data lake can help to reduce storage costs - warehouse by reducing redundancy and integrating disparate sources and lake by storing data on cheap hardware. Data Lakehouse combines these techniques by design and creating the cheapest possible way to store data.
- BI and analytics silos: Business analysts mostly use warehouses as an integrated data source while data scientists work with lakes, using analytics techniques to navigate the unsorted data. Their work can often overlap or even contradict each other, but with a Data Lakehouse, both teams are working from the same repository.
- Limited support for advanced analytics: Advanced analytics, including machine learning and predictive analytics, often requires processing very large datasets. The common tools make it easy to read the raw data lakes in open data formats, but these tools won’t read most of the proprietary data formats in the data warehouses. In the Data Lakehouse, these common toolsets can operate directly on high-quality, timely data stored in the data lake.
- Performance for business intelligence: Performance concerns related to BI and decision support requiring high performance were often the reason companies maintained a data warehouse in addition to a data lake. Data Lakehouse provides support for indexing, locality controls, query optimization and hot data caching to improve performance.
- Data staleness: Because the data warehouse is populated from the data lake, it is often stale, forcing the majority of analysts to use out-of-date data. While eliminating the data warehouse tier solves this problem, a Data Lakehouse can also support efficient, easy and reliable merging of real-time streaming with batch processing, ensuring the most up-to-date data is being used.
- Data stagnation: Stagnation is a major problem in a data lake, which can quickly become a data swamp. Organisations tend to dump their data into a lake without properly cataloguing it, making it hard to know if the data has expired. The Data Lakehouse structure brings greater organization and helps to identify data that is surplus.
- Risk of future incompatibility: Data analytics is still an emerging technology, with new tools and techniques emerging every year. Some of these might only be compatible with data lakes, while others might only work with warehouses. The flexible Data Lakehouse structure prepares companies for the future either way.
In this blog, we have touched upon the background, features and challenges Data Lakehouse can help you with. Stay tuned for an upcoming post of this series, in which we will look more closely at the Oracle Data Lakehouse and how your company can leverage it to drive new business value.
Want to find out more about the Data Lakehouse?
Improve your business performance and gain a competitive advantage by modernizing your data management and analytics with a Data Lakehouse.
Register now for our free webinar: The Oracle Data Lakehouse. Everything you need to know now.