Hone logo
Hone
Problems

Calculating Percentiles with SQL Window Functions

This challenge focuses on calculating percentiles for a given dataset within specific partitions using SQL window functions. Percentiles are crucial for understanding data distribution, identifying outliers, and comparing values against a baseline. Mastering this skill will enable you to perform sophisticated data analysis directly within your database.

Problem Description

You are tasked with calculating the 50th percentile (median) and the 90th percentile for the score column, partitioned by the category column. The results should be returned alongside the original data.

Key Requirements:

  • For each row in the input table, calculate the 50th percentile (median) of score within its category.
  • For each row, calculate the 90th percentile of score within its category.
  • The output should include the original columns (id, category, score) along with the calculated 50th and 90th percentiles.

Expected Behavior:

The percentile calculation should consider all rows within the same category as the current row. The ordering for percentile calculation is based on the score column.

Edge Cases:

  • Empty Categories: If a category has no associated records, no percentile should be calculated for that category (though this scenario is unlikely with the provided constraints).
  • Single-Record Categories: If a category has only one record, the percentile should be the score of that single record.
  • Duplicate Scores: The percentile calculation should handle duplicate scores correctly according to standard percentile definitions.

Examples

Example 1:

Input Table: DataScores

idcategoryscore
1A10
2A20
3A30
4B50
5B60
6C100

Output Table:

idcategoryscorep50_scorep90_score
1A102030
2A202030
3A302030
4B505560
5B605560
6C100100100

Explanation:

  • Category A: Scores are 10, 20, 30.
    • The median (50th percentile) is 20.
    • The 90th percentile, when ordered (10, 20, 30) and considering 3 elements, is the third element, 30.
  • Category B: Scores are 50, 60.
    • The median (50th percentile) is the average of the two middle values, (50+60)/2 = 55.
    • The 90th percentile, when ordered (50, 60) and considering 2 elements, is the second element, 60.
  • Category C: Score is 100.
    • The median (50th percentile) and 90th percentile are both 100.

Example 2:

Input Table: DataScores

idcategoryscore
10X10
11X10
12X20
13Y70
14Y80
15Y80
16Y90

Output Table:

idcategoryscorep50_scorep90_score
10X101020
11X101020
12X201020
13Y708090
14Y808090
15Y808090
16Y908090

Explanation:

  • Category X: Scores are 10, 10, 20.
    • The median (50th percentile) is 10.
    • The 90th percentile is 20.
  • Category Y: Scores are 70, 80, 80, 90.
    • The median (50th percentile) is the average of the two middle values: (80+80)/2 = 80.
    • The 90th percentile is 90.

Constraints

  • The DataScores table will contain between 1 and 1000 rows.
  • The id column is unique and an integer.
  • The category column is a string and will contain between 1 and 10 distinct categories.
  • The score column is an integer and will range from 0 to 1000.
  • The SQL dialect should support window functions, specifically PERCENTILE_CONT or equivalent functions like PERCENTILE_DISC or custom implementations if necessary.
  • The solution should be efficient enough to run within 5 seconds on the given constraints.

Notes

  • You will need to use window functions in your SQL query.
  • The PERCENTILE_CONT(fraction) WITHIN GROUP (ORDER BY expression) function is a common way to calculate percentiles. fraction represents the percentile (e.g., 0.50 for the 50th percentile, 0.90 for the 90th percentile).
  • The PARTITION BY clause within the window function is essential for calculating percentiles independently for each category.
  • Consider how your chosen SQL dialect handles interpolation for percentiles when the rank falls between two values. PERCENTILE_CONT typically interpolates.
Loading editor...
plaintext