Hone logo
Hone
Problems

SQL Data Quality Validation

This challenge focuses on implementing essential data quality checks within SQL. Ensuring data accuracy, completeness, and consistency is a cornerstone of reliable data analysis and system functionality. You will be tasked with writing SQL queries to identify common data quality issues.

Problem Description

You are given two tables: customers and orders. Your goal is to write SQL queries that identify and report on potential data quality problems within these tables. This involves checking for missing values, duplicate entries, and inconsistent data formats.

Key Requirements:

  1. Identify missing critical information: Detect records where essential fields are NULL.
  2. Detect duplicate records: Find rows that are identical across specified columns.
  3. Validate data formats/ranges: Check if data conforms to expected patterns or falls within acceptable ranges.
  4. Report findings: Each check should produce a clear output indicating the nature of the problem and the affected records.

Expected Behavior:

Your SQL queries should return specific sets of records or summary counts that highlight the data quality issues identified.

Edge Cases to Consider:

  • Empty tables.
  • Tables with no data quality issues.
  • Different data types for the same logical field (though not explicitly tested in basic examples, good to keep in mind).

Examples

Example 1: Missing Email Addresses

Imagine the customers table needs every customer to have an email address for marketing communications.

Input Tables:

customers

customer_idfirst_namelast_nameemail
1AliceSmithalice@example.com
2BobJohnsonNULL
3CharlieBrowncharlie@test.com
4DavidWilliamsNULL

Output Query Result:

customer_idfirst_namelast_nameemail
2BobJohnsonNULL
4DavidWilliamsNULL

Explanation: The query identifies customers whose email field is NULL, indicating missing contact information.

Example 2: Duplicate Customer Entries

Suppose customer records are considered duplicates if they share the same first_name, last_name, and email.

Input Tables:

customers

customer_idfirst_namelast_nameemail
1AliceSmithalice@example.com
2BobJohnsonbob@example.com
3AliceSmithalice@example.com
4CharlieBrowncharlie@test.com
5BobJohnsonbob@example.com

Output Query Result:

first_namelast_nameemailcount
AliceSmithalice@example.com2
BobJohnsonbob@example.com2

Explanation: This query groups customers by their first name, last name, and email, and reports those combinations that appear more than once, along with their frequency.

Example 3: Invalid Order Dates

Orders should ideally have an order_date that is not in the future.

Input Tables:

orders

order_idcustomer_idorder_dateamount
10112023-10-2650.00
10232023-10-2775.50
10312024-01-1525.00
10422023-10-28100.00
10542024-02-0160.00

Output Query Result:

order_idcustomer_idorder_dateamount
10312024-01-1525.00
10542024-02-0160.00

Explanation: The query selects orders where the order_date is after the current date (assuming 'current date' is Oct 29, 2023, for demonstration purposes), indicating potential data entry errors.

Constraints

  • The customers table has columns: customer_id (INTEGER, PRIMARY KEY), first_name (VARCHAR), last_name (VARCHAR), email (VARCHAR).
  • The orders table has columns: order_id (INTEGER, PRIMARY KEY), customer_id (INTEGER, FOREIGN KEY referencing customers), order_date (DATE), amount (DECIMAL).
  • Assume the current date for date validation checks is a fixed point in time, or can be accessed via a standard SQL function (e.g., CURRENT_DATE).
  • Performance is a secondary concern for this challenge; correctness and clarity of the data quality checks are paramount.

Notes

  • You'll need to write multiple SQL queries, one for each distinct data quality check required.
  • For duplicate detection, consider which columns uniquely identify a record for the purpose of duplication.
  • Think about how to handle NULLs appropriately in your checks.
  • The specific SQL dialect is flexible, but standard SQL syntax is preferred.
Loading editor...
plaintext