Data Warehouses VS Data Lakes

Huzefa Khan
4 min readJan 2, 2021

--

Data Warehouses

First appearing in the 1980s, data warehouses evolved as big data emerged and became one of the first solutions to accommodate its needs. Before the data warehouse, companies were storing data in lots of different systems with no way to unite the disparate databases. For analysts, a disparate system can make it impossible to synthesize a total view on which to base actionable business insights.

Data warehouses brought the collection of databases all under a single umbrella and allowed the data to be queried and viewed as a whole.

Advantages

In the late 1990’s, data warehouses were the most dominant data architecture for big companies. The primary advantages include:

  • Standard structured query language (SQL) for access
  • Integration of many data sources
  • Data optimized for fast reads
  • Ability to run quick ad-hoc analytical queries

These advantages enable data analysts to use an intuitive and powerful query language, SQL, to easily access data from a variety of sources. Also, the data is stored such that it can be delivered quickly to show results of analytical queries.

Challenges

As the speed and scale of data really exploded though, some deficiencies became clear:

  • Inability to store unstructured, raw data
  • Difficult to scale
  • Requires significant investment in a proprietary system

All of these add up to bottlenecks and roadblocks for data analysts.

Since warehouses are unable to store unstructured, raw data, data teams who collect both are automatically resigned to work within different systems. As big data gets bigger, it continues to push the limits of a legacy data warehouse. Scaling up requires engineers dedicated strictly to managing the infrastructure. More time on systems engineering translates into delays in analytics.

Once a company builds expensive, proprietary hardware and software into their system, it can be almost impossible to leave it.

Now, there are some cloud-based data warehouse solutions that have been built to address some of these challenges, but still — we are missing support for unstructured data and streaming data. As a result, many organizations employ data warehouses only for smaller subsets of their data, as in the diagram below. You can see that much of their data is flowing into data storage outside of the data warehouse, and only subsets are available at the data warehouse level.

Data Lakes

Data Lakes are newer to the landscape, emerging around 2010 and developing over the past decade, to answer some of the problems organizations were facing with their data warehouses. As big data continues to evolve, the data we collect is increasingly unstructured, fast-moving, and high-volume.

Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema or structure on it up front. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases, data lakes can process all data types including images, video, audio and text.

Today, companies have lots of data, but it’s often isolated and siloed away in different storage systems: data warehouses, databases, and other storage systems across the enterprise. A data lake breaks down these data silos, centralizing and consolidating all of your organization’s data assets into a complete and authoritative data store for analytics that is always up to date. Unifying all data in a data lake is the first step for companies that aspire to harness the power of machine learning and data analytics to win in the next decade.

A data lake’s flexible, unified architecture opens up a wide range of new use cases for cross-functional enterprise scale analytics, BI, and machine learning projects that can unlock massive business value. Data analysts can harvest rich insights by querying the data lake using SQL, data scientists can join and enrich data sets to generate ML models with ever greater accuracy, data engineers can build automated ETL pipelines, and business intelligence analysts can create visual dashboards and reporting tools faster and easier than before. These use cases can all be performed on the data lake simultaneously, without lifting and shifting the data, even while new data is streaming in.

Advantages

  • Can hold all of an organization’s data: structured, unstructured, and semi-structured
  • Centralizes data for access by the whole data team
  • Storage is relatively inexpensive, and the quantity of storage can be easily increased for scalability

Challenges

  • Data reliability — without proper tools, it can be difficult to maintain data lakes which can threaten the veracity of an organizations data.
  • Query performance — in many systems, as data lakes grow larger, query performance suffers. This can be due to a variety of factors including slowdowns around metadata management and improper data partitioning.

Use this chart to help compare Data lakes and Data Warehouses

Reference: https://docs.databricks.com/data/index.html

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet