AWS Glue Data Quality
Hello and welcome to this article, where I’ll be discussing AWS Glue Data Quality.
Keeping data healthy is a challenging part nowadays! When everyone is talking about Data and AI.
Maintaining data integrity continues to be a hurdle for businesses. The task of reviewing recent and historical data for potential errors requires considerable time and effort. Any incorrect or misformatted data can severely damage your data pipeline. These issues can cause business problems such as production failures, skewed data distribution, and unreliable data statistics, significantly affecting business operations.
To address this challenge, AWS introduced Glue Data Quality, which aims to analyze, monitor, and provide data integrity in your AWS Glue Data Catalog or Glue ETL Jobs. Using the DeeQu framework, an open source tool developed by Amazon, makes it easy to calculate data quality metrics.
Glue Data Quality offers inherent advantages in deploying AWS services — it works completely serverless on AWS’s infrastructure, is accessible through the Glue Console, and accepts pay-as-you-go
So now that we know what the service does, let’s talk about how you use it.
You can use Glue Data Quality with both the Glue Data Catalog and with Glue ETL Jobs. The use case and features for each of these are slightly different.
Glue Data Catalog
Use Glue Data Quality with the Data Catalog when looking for quality issues with data that already exists in a data lake or dataset.
Glue ETL Jobs
And you’d use Glue Data Quality with Glue ETL Jobs to proactively filter out bad data before it gets into your data lake or data catalog.
Once you pick your use case, the bulk of the work is largely the same. There are four main steps of using Glue Data Quality:
- Writing data quality rules with DQDL
- Rule Evaluation
- Setting up alerts and orchestration
- Visualizing Data Quality scores and metrics
Writing data quality rules with DQDL
To find data quality issues, you have to write your own set of rules to identify them. To write these rules, you’ll use a domain-specific language created by AWS called the Data Quality Definition Language or DQDL.
Every rule you write using DQDL will start with a rule type. When you select the Data Quality feature, it will present you with this screen. On the left side of the screen, it shows all the available rule types you can use. On the right side of the screen, it shows an empty rule set that you can edit to write your rules.
You can add rules to this editor by clicking on the rule type you want to use. For example, one rule type is “isUnique”, which checks if all the values in a column are unique. If I click on the isUnique rule type, it populates the rule type and then specifies a column called column A. I can change this column to better fit my data. For example, I can change it to represent an “id” column instead to check if all values in my id column are unique.
As you can see, there’s a lot of different rule types that you have available to you. If you’re looking to create your own rules, you can find a reference to the data quality language at this link that helps you better understand each rule type.
Evaluation
Once you write your rules, you can then define what happens once you detect data quality issues. You have several output options. The first is to output the original input data, which will stop the job when quality issues are detected. If you select this feature, you can also optionally add a column to your data that specifies data quality issues.
The second option is to output the results of your rules, to see the pass/fail status of each.
The third option is to publish results to CloudWatch. This is turned on by default, however, you can turn it off if you’d like.
And then finally, you can specify what action you’d like to take if your Data Quality Rules fail. If they fail, you can either continue with the job, fail the job after loading the target data, or fail the job without loading the target data.
Once you define these actions, you can then save your Job. Finally, you can run the job and see details about the Job run in the Glue dashboard and in CloudWatch.
Alerts
You can also integrate it with other services on Amazon for your data pipelines to stay on top of new data issues. For example, if you want to create an event-driven pipeline with data quality in mind, you can use Amazon EventBridge to emit events after your data quality rulesets run. You can then filter events based on success, failures, and other parameters. Meaning, you can set up alerts for if your data quality rules fail. And even format notifications into emails or text messages by using EventBridge to AWS Lambda function to send AmazonSNS notifications to ensure that the appropriate people are being notified of that failure.
Keep in mind if you do want to use Amazon EventBridge with Glue Data Quality, you have to ensure that you’re keeping “Publish results to CloudWatch” selected.
Additionally, if you prefer to set up schedules for your rules to run, you can do that as well. One way to do this is by using Amazon EventBridge and AWS Step Functions to orchestrate Glue Data Quality evaluations. The second way is to use the Run Frequency feature that is built-in to the Glue Data Quality platform to specify the cadence you’d like to run your rules. This run frequency feature is only available if you’re using Data Quality with Glue Data Catalog.
Visualizing Data Quality scores and metrics
Finally, the last thing to mention is you can integrate AWS Glue Data Quality with Amazon Athena and Amazon Quicksight to better visualize your Data Quality results. You can do this by exporting Glue Data Quality evaluation results to an Amazon S3 bucket. From there, you can use Amazon Athena to run queries against those results. You can also use Amazon Quicksight to build dashboards to represent data quality metrics and trends.
Overall, Glue Data Quality provides an easy-to-use framework to manage data quality effectively, whether you’re dealing with incoming data or maintaining existing datasets.