What Makes up a Good Data Lake?

Data lake

6 min readMar 20, 2023

A data lake is a place for your business or enterprise to store and collect data. The data you store in your data lake may be structured or unstructured, meaning it can have a defined schema or not.

Our objective in creating a data lake is to establish a centralized repository that can house all of our business-related data. This would enable us to conduct analytical studies on the data at some point in the future. The data in question could be derived from a range of sources, such as our transactional systems, line of business applications, IoT devices, mobile applications, and even social media.

Companies that are able to aggregate, and work on their data, and derive meaning from it will be able to outperform their peers. These companies might do so through the use of generic data analytics or even by using machine learning to provide valuable insights.

An effective data lake should effectively address five primary challenges:

Providing ample and efficient storage for the data within the lake itself.
Facilitating efficient data movement to the lake from various sources.
Enabling effective data cataloging and discovery to aid in the identification and classification of relevant data.
Enabling the extraction of valuable insights through generic analytics.
leveraging the available data to enable predictive analytics, which can help make informed decisions about the future based on past trends and patterns.

Storage

When it comes to storing vast amounts of raw data, AWS offers a solution that can effectively handle such crushing volumes. S3, in particular, is a service that excels in this area, thanks to its ability to store unstructured data of all kinds, such as log files, JSON transaction documents, and blobs of binary output. This makes it a much more suitable option compared to traditional databases.

Another significant advantage of using S3 is its ability to support lifecycle policies. By implementing these policies, we can effectively manage the cost of storing ever-increasing amounts of data. For instance, we can move infrequently accessed data to a cheaper storage tier, and eventually transfer it to Glacier, AWS’s deep archival service, if we determine that the data is unlikely to be accessed anytime soon. If we ever need to retrieve such data, we can easily move it back into S3 standard.

Data Movement

Another important thing to figure out when building your data lakes is how the heck are you planning to actually get your data into it. We know that S3 is what we should use for storage, but what mechanisms do we want to use to get all the stuff… into s3?

We could of course manually move large folders of archived log data into whatever bucket we are using for our data lake, but that idea is not super scalable and honestly just feels bad. It would be great to automatically push our business data into this bucket.

There are a few ways of getting your data into your bucket, be it from actively streaming your data with kinesis, to using a direct connect from on-premises to bring in large quantities of data, or using the database migration service to move your database information into s3, or you might even have to have snowball devices delivered to some faraway outpost once a month to collect research data to have sent back to AWS. Whatever your method, you will need a way to move your data into AWS, and you will prefer that whatever way you use is automate.

Data Cataloging and Discovery:

Data cataloging and discovery are critical components of a data lake. Without proper management and organization, it can quickly become a data swamp, making it challenging to find specific data sets. To prevent this, it is essential to create metadata, or data about the data. This metadata includes information about data formats, compression, sensitivity, and any additional tags that would be helpful to know.

There are many ways to create a data catalog. One approach is to set up an upload event on your S3 bucket, which triggers a Lambda function to store metadata information in DynamoDB.

Another best option is to use AWS Glue, a managed transform engine that allows you to run ELT pipelines and includes a robust data catalog with built-in crawlers. These crawlers can crawl through various data sources, such as S3 buckets, databases, and data warehouses, to automatically populate the catalog. This eliminates the need for manual intervention and ensures that the catalog is always up-to-date.

Generic Analytics:

Collecting data is essential for gaining insights into the performance of our business in the past. The information contained in our data can help us identify what strategies were successful and which ones failed.

AWS provides a range of powerful services that can help us make sense of our data. These services vary in their analytical capabilities and objectives.

kinesis or Amazon MSK

if you wanted to get some real-time information about your data lake, or at least the information being streamed into it from kinesis or Amazon MSK for example, you can use Kinesis Data Analytics to get a real time feed of what your streaming data is up to.

Amazon Athena

If you were looking to interactively scrub through your data we have Amazon Athena, a purposely built service that makes it easy to analyze data in Amazon S3 using standard SQL.

Amazon Quicksight

If you have some section of your data that you want to create dashboards and graphs for, that’s where something like Amazon Quicksight can be added to your solution

Amazon Redshift

we also have data warehousing services like redshift that you can place a subset of our data lake within to perform general analytics on to try and derive some meaning from that data

Predictive analytics:

Predictive analytics can provide valuable insights into a business’s future by analyzing its data, and creating machine learning systems is key to achieving this goal. To do so effectively, it is important to have a robust dataset, which is where a data lake comes in handy. AWS SageMaker is an excellent tool for quickly developing, training, and deploying machine learning models within the AWS ecosystem. Furthermore, AWS offers a range of deep learning AMIs pre-configured with popular deep learning frameworks and interfaces like TensorFLow, PYtorch, Apache MXNet, Chainer, Gluon, Horovod, and Kera. These AMIs don’t incur any additional charges beyond the usual pay-as-you-go instance fees.