Data Quality Report

7 min readFeb 1, 2021

The purpose of this report is to log key findings from the “Data exploration and Pre Processing” exercise conducted as a part of CRISP DM process. Ideally a separate report ought to be generated for each individual data file.

The report includes Key parameter related to data Source, metadata, data inconsistencies, Data summary, statistical tests conducted, transformations performed and visualization techniques employed to describe and analyze the data.

Data Sources and Feature Engineering

Provide brief description about the supplied Raw Data Sources before processing.

1. Raw data sources,

2. the processed/transformed data,

3. and feature sets.

For each dataset, an individual report describing

· Schema of the Raw Data.

· The meaning of each data field

· and other information that is helpful for understanding the data is provided.

· If the dataset is the output of processing/transforming/feature engineering existing data set(s), the names of the input data sets, and the links to scripts that are used to conduct the operation are also provided.

Raw Data Sources

Provide details about the Raw Data Sources shared by the client.

Dataset X Parameters

For each dataset provide following information:

Size/Volume of the data
How is the data accessed such as via API or from SQL dumps, excel, CSV files. Any login credentials utilized must also be logged here.
Meta Data details.
Data Schema developed or existing.
Use separate headings for each of the above.

Processed Data

Provide details about the output from the processing exercise. Basically all relevant details related to how data was processed from its raw format to a format or shape that is in needed shape for the Data mining requirements.

Feature Sets

This section is meant to provide details of Feature Engineering (If any) performed on the Raw Data Set.

· Provide meaning of each feature generated as a part of this exercise.

· Any dummy variables created.

· Techniques used to deal with categorical data.

General summary of the data

This should include at least following:

· Get Data Shape

· Get Attribute Description and Variable Type

· Get Summary for each Attribute/Feature such as 5 point summary, IQR etc.

· Get ols (ordinary least squares) results for each feature.

· Generate linear and Pair Wise plots for each feature. This should include Histograms and scatter plots. This should be both univariate and bi variate plots.

· Generate Correlation and Variance Covariance matrices for each variable.

· In general all major attributes from either of the following tools can be inserted here:

· Microsoft IDEAR

· RATTLE , Recommender

· Pandas profiling

Data Quality Issues

· Determine missing records for each Feature.

· Determine records with Errors, Div0, NaN etc.

· Determine outliers in the data.

· Determine the records where value data type is unexpected i.e. different from majority values.

· Determine records where unexpected value is found.

Data Transformation Steps

List details related to:

· Standardization Performed on the Raw Data.

· Feature Scaling Performed and the type of scaling used.

· Data Normalization

· Data Discretization Performed.

· Any feature Reduction steps taken such as PCA etc. Refer using multicollinearity and correlation metrics. Similarly columns with only a single value can be removed.

Data quality summary

Get following statistics related to the source data:

Dimensions of data quality

Accuracy

Also known as validity. The degree to which the measures conform to defined business rules or constraints. The data measure what they are intended to measure. Following data constraints should be considered .

o Data-Type Constraints — e.g., values in a particular column must be of a particular datatype, e.g., Boolean, numeric (integer or real), date, etc.

o Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.

o Mandatory Constraints: Certain columns cannot be empty.

o Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.

o Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person’s gender may be Female, Male or Unknown (not recorded).

o Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999–9999.

o Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient’s date of discharge from hospital cannot be earlier than the date of admission.