Data Quality Assurance with SQL
Data quality is paramount for reliable decision-making. This challenge requires you to implement a series of SQL queries to perform data quality checks on a given dataset, identifying potential issues like missing values, invalid data types, and inconsistencies. Successfully completing this challenge demonstrates your ability to ensure data integrity using SQL.
Problem Description
You are provided with a table named customer_data containing customer information. Your task is to write SQL queries to perform the following data quality checks and report the findings:
- Missing Values: Identify the number of rows with missing values in each column.
- Data Type Validation: Verify that the
order_datecolumn contains valid dates. Report the number of rows where theorder_dateis not a valid date. - Range Validation: The
order_totalcolumn should contain values between 0 and 1000 (inclusive). Report the number of rows whereorder_totalfalls outside this range. - Uniqueness Validation: Ensure that the
customer_idcolumn contains unique values. Report the number of duplicatecustomer_idvalues. - Consistency Check: Verify that the
countrycolumn only contains values from a predefined list: 'USA', 'Canada', 'UK', 'Germany', 'France'. Report the number of rows where thecountryvalue is not in this list.
The output for each check should be a single row with two columns: check_name and error_count.
Table Schema:
customer_data
| Column Name | Data Type |
|---|---|
customer_id | INT |
first_name | VARCHAR(255) |
last_name | VARCHAR(255) |
email | VARCHAR(255) |
order_date | VARCHAR(255) |
order_total | DECIMAL(10, 2) |
country | VARCHAR(255) |
Examples
Example 1:
Input:
customer_data table with some missing values, invalid dates in order_date, order_totals outside the range, duplicate customer_ids, and countries not in the allowed list.
Output:
| check_name | error_count |
|---|---|
| Missing Values | 15 |
| Data Type Validation | 5 |
| Range Validation | 3 |
| Uniqueness Validation | 2 |
| Consistency Check | 7 |
Explanation:
The output reflects the number of errors found for each data quality check. For instance, 15 rows have missing values across various columns, 5 rows have invalid dates in the `order_date` column, and so on.
Example 2:
Input:
customer_data table with all values valid and no missing data.
Output:
| check_name | error_count |
|---|---|
| Missing Values | 0 |
| Data Type Validation | 0 |
| Range Validation | 0 |
| Uniqueness Validation | 0 |
| Consistency Check | 0 |
Explanation:
Since all data is valid, the error count for each check is 0.
Example 3: (Edge Case - Empty Table)
Input:
customer_data table is empty.
Output:
| check_name | error_count |
|---|---|
| Missing Values | 0 |
| Data Type Validation | 0 |
| Range Validation | 0 |
| Uniqueness Validation | 0 |
| Consistency Check | 0 |
Explanation:
An empty table results in zero errors for all checks.
Constraints
- The
customer_datatable will contain between 0 and 1000 rows. - The
order_datecolumn will contain strings representing dates in various formats (e.g., 'YYYY-MM-DD', 'MM/DD/YYYY', invalid strings). - The
order_totalcolumn will contain decimal numbers. - The
countrycolumn will contain strings. - The SQL dialect should be compatible with standard SQL (e.g., PostgreSQL, MySQL, SQL Server). Assume a standard date format conversion function is available (e.g.,
TRY_CASTin SQL Server,TO_DATEin PostgreSQL/MySQL). - Performance is not a primary concern for this challenge, but avoid excessively inefficient queries.
Notes
- Consider using
COUNT(*)andCASEstatements to efficiently count errors. - For data type validation, use appropriate functions to attempt conversion to the expected data type and handle errors gracefully.
TRY_CASTor similar functions are recommended to avoid query failures due to invalid data. - For range validation, use
WHEREclauses with appropriate comparison operators. - For uniqueness validation, use
GROUP BYandHAVINGclauses to identify duplicate values. - For consistency checks, use
WHEREclauses withINoperators to filter for allowed values. - The
check_namecolumn in the output should be exactly as specified in the problem description. - The
error_countcolumn should be an integer representing the number of errors found for each check. - You are expected to provide a single query that combines all the checks into a single result set. Using multiple queries is acceptable, but a single query demonstrating efficient data quality assessment is preferred.
- Assume that the database system supports common SQL functions like
COUNT,CASE,GROUP BY,HAVING,IN, and date conversion functions.