The "Popularity Problem": An Application of SQL Analytic Ranking Functions (Part 1)
This is the first in a series of occasional articles on SQL features and coding techniques.
A situation that comes up frequently in my work is the selection of a "primary" or "dominant" attribute for an individual or group. Some examples are:
A Simple Case - Who won the Election?
Let's start with the simplest variant of the problem, the election. Here, there is a population of people, and a set of candidates. We assume there is one vote per person, and everyone's vote counts equally. A sample data set might look like this:
To find the winner, we have to count how many votes each candidate received, then sort on the total number of votes (highest to lowest). The person at the top of the list is the winner. In fact, if we don't care who came in second, third, or fourth, then we don't need to sort; we can just choose the candidate having the highest number of votes. This is a simple and obvious example, but it illustrates some concepts that we will need later.
In any version of SQL, we can write a one-statement query to list the candidates in order:
You might already see a problem. Who came in third? The SQL query we just ran will give a non-deterministic answer. Since we have no way to handle a tie, the query could put either Mr. Nader or Mr. Barr in third place, and there is no way to tell in advance what the result will be. In fact, you could run the exact same query twice, and get two different results.
So we need a tiebreak rule. In every problem of this type, a tiebreak rule is required in order to get consistent, reproducible results. Also, the tiebreak rule must be sufficient to completely disambiguate the candidates. In SQL terms, the tiebreak rule must be created using a set of columns that uniquely identify each row. This is also known as a candidate key.
In our simple election example, we can get by if we add a secondary sort, using the candidate's name. It is arbitrary whether to sort alphabetically ascending or descending (or using some other method, like the reverse of the string). The sorting method will of course affect the result, and care should be taken to choose a fair (or at least reasonable) method if possible. But a choice must be made.
Here is our finished query to give the election results:
How would we get back a single row, showing only the winner? The easiest way would be to use the "TOP 1" option with the above query. Another way would be to assign a sequential ranking number to each candidate, and then select the one whose rank=1. However if your version of SQL does not have the functionality to do either of those, but does allow subqueries, you could still get a solution using "brute force":
Although we got the answer using a single query, this seems like quite a lot of work! And we are scanning the "votes" table three times. It would make more sense to store some results in an intermediate table. This would also be a good place to use "Common Table Expressions", a feature introduced with Microsoft SQL Server 2005.
Magazine Subscription Problem
The preceding example has a single group. But what if we have many groups, and we want to find the most popular selection within each group? The magazine subscription problem is a good illustration of this situation:
We want to identify the single most popular magazine within each ZIP code. As before, we need a tiebreak rule, which will just be the magazine name in (ascending) alphabetical order. Here is the "brute force" method:
Again, an awful lot of work and duplicated code.
Here is an alternative method, using SQL Server 2000 and the "IGNORE_DUP_KEY" indexing option:
This requires an extra table, a couple of steps, and a Microsoft-specific feature, but for a large data set it will perform better than the "brute force" method. Notice the
Let's see how using Analytic Functions might help.
Analytic functions are also known as Ranking or Windowing functions. In Microsoft SQL Server 2005, they include the functions RANK(), DENSE_RANK(), NTILE(), and ROW_NUMBER(). Oracle has had analytical functions since Version 8i (1999). In addition to the four functions just listed, Oracle includes a much richer set of functions, including CUME_DIST(), FIRST_VALUE(), LAST_VALUE(), and PERCENT_RANK(). Unlike SQL Server, Oracle functions can also operate on a sliding window of rows, so that you can do things like a moving average within a single query.
For the magazine subscription problem, we can use the ROW_NUMBER() function. This function will return a unique, sequential, ascending number for each row in a result set. But it can also operate on groups of rows that are defined by a partition, so that each group has its own set of sequence numbers.
The method is to count the number of subscriptions, grouping by both ZIP_code and magazine name, and then use ROW_NUMBER() to assign a popularity ranking for each magazine within each ZIP code:
Again, notice the use of the magazine name to enforce the tiebreak rule - it is included in the
Now, to get the single most popular magazine for each ZIP, all we have to do is filter on mag_rank = 1:
This is even easier in Oracle, where the FIRST_VALUE() function does the filtering for you:
Be careful, you need the
In Part 2 of this article, we will look at a more complicated example that illustrates how to use a weighting factor and how to create multiple result columns in a single operation.
A situation that comes up frequently in my work is the selection of a "primary" or "dominant" attribute for an individual or group. Some examples are:
- Find the most popular magazine subscription in a ZIP code
- Identify a store where a customer spends the most money
- Who won the election?
A Simple Case - Who won the Election?
Let's start with the simplest variant of the problem, the election. Here, there is a population of people, and a set of candidates. We assume there is one vote per person, and everyone's vote counts equally. A sample data set might look like this:
| Voter | Candidate |
|---|---|
| John Q. Public | John McCain |
| Ennio Citizen | Barack Obama |
| Mary J. Student | Barack Obama |
| Alfred E. Oldman | Ralph Nader |
| Carter Perch | Bob Barr |
| Roger Clifton | Barack Obama |
| Eddie Thomas | John McCain |
In any version of SQL, we can write a one-statement query to list the candidates in order:
SELECT candidate
FROM votes
GROUP BY candidate
ORDER BY COUNT(*) DESC
You might already see a problem. Who came in third? The SQL query we just ran will give a non-deterministic answer. Since we have no way to handle a tie, the query could put either Mr. Nader or Mr. Barr in third place, and there is no way to tell in advance what the result will be. In fact, you could run the exact same query twice, and get two different results.
So we need a tiebreak rule. In every problem of this type, a tiebreak rule is required in order to get consistent, reproducible results. Also, the tiebreak rule must be sufficient to completely disambiguate the candidates. In SQL terms, the tiebreak rule must be created using a set of columns that uniquely identify each row. This is also known as a candidate key.
In our simple election example, we can get by if we add a secondary sort, using the candidate's name. It is arbitrary whether to sort alphabetically ascending or descending (or using some other method, like the reverse of the string). The sorting method will of course affect the result, and care should be taken to choose a fair (or at least reasonable) method if possible. But a choice must be made.
Here is our finished query to give the election results:
SELECT candidate
FROM votes
GROUP BY candidate
ORDER BY COUNT(*) DESC
, candidate DESC
How would we get back a single row, showing only the winner? The easiest way would be to use the "TOP 1" option with the above query. Another way would be to assign a sequential ranking number to each candidate, and then select the one whose rank=1. However if your version of SQL does not have the functionality to do either of those, but does allow subqueries, you could still get a solution using "brute force":
SELECT A.candidate
FROM (SELECT candidate
, COUNT (*) AS num_votes
FROM votes
GROUP BY candidate
) A
, (SELECT num_votes
, MAX (candidate) AS last_candidate_name
FROM (SELECT candidate
, COUNT (*) AS num_votes
FROM votes
GROUP BY candidate
) BB
GROUP BY num_votes
) B
, (SELECT MAX(num_votes) AS max_votes
FROM (SELECT candidate
, COUNT (*) AS num_votes
FROM votes
GROUP BY candidate
) CC
) C
WHERE A.candidate = B.last_candidate_name
AND A.num_votes = B.num_votes
AND A.num_votes = C.max_votes
Although we got the answer using a single query, this seems like quite a lot of work! And we are scanning the "votes" table three times. It would make more sense to store some results in an intermediate table. This would also be a good place to use "Common Table Expressions", a feature introduced with Microsoft SQL Server 2005.
Magazine Subscription Problem
The preceding example has a single group. But what if we have many groups, and we want to find the most popular selection within each group? The magazine subscription problem is a good illustration of this situation:
| Subscriber | ZIP code | Magazine |
|---|---|---|
| John Q. Public | 20001 | Life |
| John Q. Public | 20001 | Washingtonian |
| Ennio Citizen | 20001 | New Yorker |
| Ennio Citizen | 20001 | Washingtonian |
| Mary J. Student | 90210 | Atlantic |
| Mary J. Student | 90210 | Chronicle |
| Mary J. Student | 90210 | Life |
| Alfred E. Oldman | 10016 | New Yorker |
| Alfred E. Oldman | 10016 | Time |
| Carter Perch | 90210 | New Yorker |
| Carter Perch | 90210 | Time |
| Carter Perch | 90210 | Mad |
| Carter Perch | 90210 | Atlantic |
| Roger Clifton | 20001 | Vogue |
| Roger Clifton | 20001 | Elle |
| Eddie Thomas | 10016 | New Yorker |
| Eddie Thomas | 10016 | Atlantic |
| Eddie Thomas | 10016 | Time |
We want to identify the single most popular magazine within each ZIP code. As before, we need a tiebreak rule, which will just be the magazine name in (ascending) alphabetical order. Here is the "brute force" method:
SELECT A.ZIP_code
, A.magazine
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) A
, (SELECT ZIP_code
, num_subs
, MIN (magazine) AS first_magazine_name
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) BB
GROUP BY ZIP_code
, num_subs
) B
, (SELECT ZIP_code
, MAX(num_subs) AS max_subs
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) CC
GROUP BY ZIP_code
) C
WHERE A.ZIP_code = B.ZIP_code
AND A.magazine = B.first_magazine_name
AND A.num_subs = B.num_subs
AND A.ZIP_code = C.ZIP_code
AND A.num_subs = C.max_subs
Again, an awful lot of work and duplicated code.
Here is an alternative method, using SQL Server 2000 and the "IGNORE_DUP_KEY" indexing option:
CREATE TABLE most_popular_magazine_by_ZIP_code
( ZIP_code VARCHAR (10)
, magazine VARCHAR (30)
)
CREATE UNIQUE INDEX most_popular_magazine_by_ZIP_code_IDX
ON most_popular_magazine_by_ZIP_code
( ZIP_code
)
WITH IGNORE_DUP_KEY
INSERT INTO most_popular_magazine_by_ZIP_code
SELECT ZIP_code
, magazine
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
ORDER BY ZIP_code
, COUNT (*) DESC
, magazine
SELECT * FROM most_popular_magazine_by_ZIP_code
This requires an extra table, a couple of steps, and a Microsoft-specific feature, but for a large data set it will perform better than the "brute force" method. Notice the
ORDER BY clause for the INSERT statement - it includes the magazine name as the final sort key, to enforce the tiebreak rule. This is very important because otherwise you will have a non-deterministic result.Let's see how using Analytic Functions might help.
Analytic functions are also known as Ranking or Windowing functions. In Microsoft SQL Server 2005, they include the functions RANK(), DENSE_RANK(), NTILE(), and ROW_NUMBER(). Oracle has had analytical functions since Version 8i (1999). In addition to the four functions just listed, Oracle includes a much richer set of functions, including CUME_DIST(), FIRST_VALUE(), LAST_VALUE(), and PERCENT_RANK(). Unlike SQL Server, Oracle functions can also operate on a sliding window of rows, so that you can do things like a moving average within a single query.
For the magazine subscription problem, we can use the ROW_NUMBER() function. This function will return a unique, sequential, ascending number for each row in a result set. But it can also operate on groups of rows that are defined by a partition, so that each group has its own set of sequence numbers.
The method is to count the number of subscriptions, grouping by both ZIP_code and magazine name, and then use ROW_NUMBER() to assign a popularity ranking for each magazine within each ZIP code:
SELECT ZIP_code
, magazine
, num_subs
, row_number()
OVER (PARTITION BY zip_code
ORDER BY num_subs DESC
, magazine
) AS mag_rank
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) A
Again, notice the use of the magazine name to enforce the tiebreak rule - it is included in the
ORDER BY clause within the PARTITION specification. Here is the result:| ZIP_code | Magazine | num_subs | mag_rank |
| 10016 | New Yorker | 2 | 1 |
| 10016 | Time | 2 | 2 |
| 10016 | Atlantic | 1 | 3 |
| 20001 | Washingtonian | 2 | 1 |
| 20001 | Elle | 1 | 2 |
| 20001 | Life | 1 | 3 |
| 20001 | New Yorker | 1 | 4 |
| 20001 | Vogue | 1 | 5 |
| 90210 | Atlantic | 2 | 1 |
| 90210 | Chronicle | 1 | 2 |
| 90210 | Life | 1 | 3 |
| 90210 | Mad | 1 | 4 |
| 90210 | New Yorker | 1 | 5 |
| 90210 | Time | 1 | 6 |
Now, to get the single most popular magazine for each ZIP, all we have to do is filter on mag_rank = 1:
SELECT ZIP_code
, magazine AS most_popular_magazine
FROM (SELECT ZIP_code
, magazine
, num_subs
, row_number()
OVER (PARTITION BY zip_code
ORDER BY num_subs DESC
, magazine
) AS mag_rank
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) A
) B
WHERE mag_rank = 1
This is even easier in Oracle, where the FIRST_VALUE() function does the filtering for you:
SELECT DISTINCT ZIP_code
, first_value (magazine)
OVER (PARTITION BY ZIP_code
ORDER BY num_subs DESC
, magazine
) AS most_popular_magazine
FROM (SELECT ZIP_code
, magazine
, COUNT (*) AS num_subs
FROM magazine_subscriptions
GROUP BY ZIP_code
, magazine
) A
Be careful, you need the
DISTINCT qualifier here or you will get duplicate rows.In Part 2 of this article, we will look at a more complicated example that illustrates how to use a weighting factor and how to create multiple result columns in a single operation.
Labels: Analytic, Dominant, Election, non-deterministic, Popularity, Ranking, SQL, Windowing

