SQL Data Quality Validation
This challenge focuses on implementing essential data quality checks within SQL. Ensuring data accuracy, completeness, and consistency is a cornerstone of reliable data analysis and system functionality. You will be tasked with writing SQL queries to identify common data quality issues.
Problem Description
You are given two tables: customers and orders. Your goal is to write SQL queries that identify and report on potential data quality problems within these tables. This involves checking for missing values, duplicate entries, and inconsistent data formats.
Key Requirements:
- Identify missing critical information: Detect records where essential fields are NULL.
- Detect duplicate records: Find rows that are identical across specified columns.
- Validate data formats/ranges: Check if data conforms to expected patterns or falls within acceptable ranges.
- Report findings: Each check should produce a clear output indicating the nature of the problem and the affected records.
Expected Behavior:
Your SQL queries should return specific sets of records or summary counts that highlight the data quality issues identified.
Edge Cases to Consider:
- Empty tables.
- Tables with no data quality issues.
- Different data types for the same logical field (though not explicitly tested in basic examples, good to keep in mind).
Examples
Example 1: Missing Email Addresses
Imagine the customers table needs every customer to have an email address for marketing communications.
Input Tables:
customers
| customer_id | first_name | last_name | |
|---|---|---|---|
| 1 | Alice | Smith | alice@example.com |
| 2 | Bob | Johnson | NULL |
| 3 | Charlie | Brown | charlie@test.com |
| 4 | David | Williams | NULL |
Output Query Result:
| customer_id | first_name | last_name | |
|---|---|---|---|
| 2 | Bob | Johnson | NULL |
| 4 | David | Williams | NULL |
Explanation: The query identifies customers whose email field is NULL, indicating missing contact information.
Example 2: Duplicate Customer Entries
Suppose customer records are considered duplicates if they share the same first_name, last_name, and email.
Input Tables:
customers
| customer_id | first_name | last_name | |
|---|---|---|---|
| 1 | Alice | Smith | alice@example.com |
| 2 | Bob | Johnson | bob@example.com |
| 3 | Alice | Smith | alice@example.com |
| 4 | Charlie | Brown | charlie@test.com |
| 5 | Bob | Johnson | bob@example.com |
Output Query Result:
| first_name | last_name | count | |
|---|---|---|---|
| Alice | Smith | alice@example.com | 2 |
| Bob | Johnson | bob@example.com | 2 |
Explanation: This query groups customers by their first name, last name, and email, and reports those combinations that appear more than once, along with their frequency.
Example 3: Invalid Order Dates
Orders should ideally have an order_date that is not in the future.
Input Tables:
orders
| order_id | customer_id | order_date | amount |
|---|---|---|---|
| 101 | 1 | 2023-10-26 | 50.00 |
| 102 | 3 | 2023-10-27 | 75.50 |
| 103 | 1 | 2024-01-15 | 25.00 |
| 104 | 2 | 2023-10-28 | 100.00 |
| 105 | 4 | 2024-02-01 | 60.00 |
Output Query Result:
| order_id | customer_id | order_date | amount |
|---|---|---|---|
| 103 | 1 | 2024-01-15 | 25.00 |
| 105 | 4 | 2024-02-01 | 60.00 |
Explanation: The query selects orders where the order_date is after the current date (assuming 'current date' is Oct 29, 2023, for demonstration purposes), indicating potential data entry errors.
Constraints
- The
customerstable has columns:customer_id(INTEGER, PRIMARY KEY),first_name(VARCHAR),last_name(VARCHAR),email(VARCHAR). - The
orderstable has columns:order_id(INTEGER, PRIMARY KEY),customer_id(INTEGER, FOREIGN KEY referencingcustomers),order_date(DATE),amount(DECIMAL). - Assume the current date for date validation checks is a fixed point in time, or can be accessed via a standard SQL function (e.g.,
CURRENT_DATE). - Performance is a secondary concern for this challenge; correctness and clarity of the data quality checks are paramount.
Notes
- You'll need to write multiple SQL queries, one for each distinct data quality check required.
- For duplicate detection, consider which columns uniquely identify a record for the purpose of duplication.
- Think about how to handle NULLs appropriately in your checks.
- The specific SQL dialect is flexible, but standard SQL syntax is preferred.