MySQL CUME_DIST Function
The CUME_DIST()
function in MySQL is a window function introduced in MySQL 8.0. It calculates the cumulative distribution of a value within a result set. This function determines the relative rank of a row compared to other rows, expressed as a percentage of the total rows.
The result is a value between 0
(exclusive) and 1
(inclusive).
Syntax
PARTITION BY
: Divides the result set into partitions. The function calculates the cumulative distribution within each partition.ORDER BY
: Specifies the order of rows within the partition. This order determines the rank of the rows.
How It Works
CUME_DIST()
calculates the proportion of rows with values less than or equal to the current row's value in the ordered partition.- Formula:
Examples
Example 1: Basic Usage
Suppose you have a table scores
:
student_id | subject | score |
---|---|---|
1 | Math | 90 |
2 | Math | 80 |
3 | Math | 80 |
4 | Math | 70 |
To calculate the cumulative distribution of scores:
Output:
student_id | score | cumulative_distribution |
---|---|---|
1 | 90 | 0.25 |
2 | 80 | 0.75 |
3 | 80 | 0.75 |
4 | 70 | 1.00 |
- For
score = 90
, the cumulative distribution is . - For
score = 80
, it appears twice, and the cumulative distribution includes both rows: . - For
score = 70
, it includes all rows: .
Example 2: Using PARTITION BY
If the the scores
table includes multiple subjects:
student_id | subject | score |
---|---|---|
1 | Math | 90 |
2 | Math | 80 |
3 | Math | 80 |
4 | Math | 70 |
5 | Science | 85 |
6 | Science | 80 |
To calculate the cumulative distribution of scores for each subject:
Output:
student_id | subject | score | cumulative_distribution |
---|---|---|---|
1 | Math | 90 | 0.25 |
2 | Math | 80 | 0.75 |
3 | Math | 80 | 0.75 |
4 | Math | 70 | 1.00 |
5 | Science | 85 | 0.50 |
6 | Science | 80 | 1.00 |
- The cumulative distribution is calculated separately for each subject.
Example 3: Identifying Top Performers
To list students who are in the top 50% of scores:
Output:
student_id | score |
---|---|
1 | 90 |
- Only the top 50% of scores are included based on the cumulative distribution.
Use Cases
Percentile Analysis:
- Analyze the rank of a row compared to the total rows in a dataset.
Top/Bottom Analysis:
- Identify rows within specific percentiles (e.g., top 10%, bottom 25%).
Cumulative Rankings:
- Evaluate the cumulative distribution for reporting or analytical purposes.
Conclusion
The CUME_DIST()
function is a powerful tool for analyzing data distributions in MySQL. It is especially useful in percentile rankings and comparative analyses, helping users understand the relative position of rows within partitions or the entire result set. By using PARTITION BY
and ORDER BY
, you can customize its behavior for various scenarios.