Calculating Percentiles with SQL Window Functions
This challenge focuses on calculating percentiles for a given dataset within specific partitions using SQL window functions. Percentiles are crucial for understanding data distribution, identifying outliers, and comparing values against a baseline. Mastering this skill will enable you to perform sophisticated data analysis directly within your database.
Problem Description
You are tasked with calculating the 50th percentile (median) and the 90th percentile for the score column, partitioned by the category column. The results should be returned alongside the original data.
Key Requirements:
- For each row in the input table, calculate the 50th percentile (median) of
scorewithin itscategory. - For each row, calculate the 90th percentile of
scorewithin itscategory. - The output should include the original columns (
id,category,score) along with the calculated 50th and 90th percentiles.
Expected Behavior:
The percentile calculation should consider all rows within the same category as the current row. The ordering for percentile calculation is based on the score column.
Edge Cases:
- Empty Categories: If a category has no associated records, no percentile should be calculated for that category (though this scenario is unlikely with the provided constraints).
- Single-Record Categories: If a category has only one record, the percentile should be the score of that single record.
- Duplicate Scores: The percentile calculation should handle duplicate scores correctly according to standard percentile definitions.
Examples
Example 1:
Input Table: DataScores
| id | category | score |
|---|---|---|
| 1 | A | 10 |
| 2 | A | 20 |
| 3 | A | 30 |
| 4 | B | 50 |
| 5 | B | 60 |
| 6 | C | 100 |
Output Table:
| id | category | score | p50_score | p90_score |
|---|---|---|---|---|
| 1 | A | 10 | 20 | 30 |
| 2 | A | 20 | 20 | 30 |
| 3 | A | 30 | 20 | 30 |
| 4 | B | 50 | 55 | 60 |
| 5 | B | 60 | 55 | 60 |
| 6 | C | 100 | 100 | 100 |
Explanation:
- Category A: Scores are 10, 20, 30.
- The median (50th percentile) is 20.
- The 90th percentile, when ordered (10, 20, 30) and considering 3 elements, is the third element, 30.
- Category B: Scores are 50, 60.
- The median (50th percentile) is the average of the two middle values, (50+60)/2 = 55.
- The 90th percentile, when ordered (50, 60) and considering 2 elements, is the second element, 60.
- Category C: Score is 100.
- The median (50th percentile) and 90th percentile are both 100.
Example 2:
Input Table: DataScores
| id | category | score |
|---|---|---|
| 10 | X | 10 |
| 11 | X | 10 |
| 12 | X | 20 |
| 13 | Y | 70 |
| 14 | Y | 80 |
| 15 | Y | 80 |
| 16 | Y | 90 |
Output Table:
| id | category | score | p50_score | p90_score |
|---|---|---|---|---|
| 10 | X | 10 | 10 | 20 |
| 11 | X | 10 | 10 | 20 |
| 12 | X | 20 | 10 | 20 |
| 13 | Y | 70 | 80 | 90 |
| 14 | Y | 80 | 80 | 90 |
| 15 | Y | 80 | 80 | 90 |
| 16 | Y | 90 | 80 | 90 |
Explanation:
- Category X: Scores are 10, 10, 20.
- The median (50th percentile) is 10.
- The 90th percentile is 20.
- Category Y: Scores are 70, 80, 80, 90.
- The median (50th percentile) is the average of the two middle values: (80+80)/2 = 80.
- The 90th percentile is 90.
Constraints
- The
DataScorestable will contain between 1 and 1000 rows. - The
idcolumn is unique and an integer. - The
categorycolumn is a string and will contain between 1 and 10 distinct categories. - The
scorecolumn is an integer and will range from 0 to 1000. - The SQL dialect should support window functions, specifically
PERCENTILE_CONTor equivalent functions likePERCENTILE_DISCor custom implementations if necessary. - The solution should be efficient enough to run within 5 seconds on the given constraints.
Notes
- You will need to use window functions in your SQL query.
- The
PERCENTILE_CONT(fraction) WITHIN GROUP (ORDER BY expression)function is a common way to calculate percentiles.fractionrepresents the percentile (e.g., 0.50 for the 50th percentile, 0.90 for the 90th percentile). - The
PARTITION BYclause within the window function is essential for calculating percentiles independently for each category. - Consider how your chosen SQL dialect handles interpolation for percentiles when the rank falls between two values.
PERCENTILE_CONTtypically interpolates.