2009/01/26

The "Popularity Problem": An Application of SQL Analytic Ranking Functions (Part 1)

This is the first in a series of occasional articles on SQL features and coding techniques.

A situation that comes up frequently in my work is the selection of a "primary" or "dominant" attribute for an individual or group. Some examples are:
  • Find the most popular magazine subscription in a ZIP code
  • Identify a store where a customer spends the most money
  • Who won the election?
In these examples, there is a one-to-many relationship that we want to convert into a one-to-one relationship. We start with a data set having a grain of one-to-many, and we want to reduce this by selecting the most popular choice. In this article, I'll discuss some variations on the problem, how to do it in SQL the "hard way", and how to use SQL Ranking functions to get to the solution easily.

A Simple Case - Who won the Election?

Let's start with the simplest variant of the problem, the election. Here, there is a population of people, and a set of candidates. We assume there is one vote per person, and everyone's vote counts equally. A sample data set might look like this:

Voter Candidate
John Q. Public John McCain
Ennio Citizen Barack Obama
Mary J. Student Barack Obama
Alfred E. Oldman Ralph Nader
Carter Perch Bob Barr
Roger Clifton Barack Obama
Eddie Thomas John McCain

To find the winner, we have to count how many votes each candidate received, then sort on the total number of votes (highest to lowest). The person at the top of the list is the winner. In fact, if we don't care who came in second, third, or fourth, then we don't need to sort; we can just choose the candidate having the highest number of votes. This is a simple and obvious example, but it illustrates some concepts that we will need later.

In any version of SQL, we can write a one-statement query to list the candidates in order:
SELECT candidate
  
FROM votes
  
GROUP BY candidate
  
ORDER BY COUNT(*) DESC


You might already see a problem. Who came in third? The SQL query we just ran will give a non-deterministic answer. Since we have no way to handle a tie, the query could put either Mr. Nader or Mr. Barr in third place, and there is no way to tell in advance what the result will be. In fact, you could run the exact same query twice, and get two different results.

So we need a tiebreak rule. In every problem of this type, a tiebreak rule is required in order to get consistent, reproducible results. Also, the tiebreak rule must be sufficient to completely disambiguate the candidates. In SQL terms, the tiebreak rule must be created using a set of columns that uniquely identify each row. This is also known as a candidate key.

In our simple election example, we can get by if we add a secondary sort, using the candidate's name. It is arbitrary whether to sort alphabetically ascending or descending (or using some other method, like the reverse of the string).
The sorting method will of course affect the result, and care should be taken to choose a fair (or at least reasonable) method if possible. But a choice must be made.

Here is our finished query to give the election results:
SELECT candidate
  
FROM votes
 
GROUP BY candidate
 
ORDER BY COUNT(*) DESC
     
candidate DESC



How would we get back a single row, showing only the winner? The easiest way would be to use the "TOP 1" option with the above query. Another way would be to assign a sequential ranking number to each candidate, and then select the one whose rank=1. However if your version of SQL does not have the functionality to do either of those, but does allow subqueries, you could still get a solution using "brute force":
SELECT A.candidate
  
FROM (SELECT candidate
             
COUNT (*) AS num_votes
          
FROM votes
         
GROUP BY candidate
        
A
     
, (SELECT num_votes
             
MAX (candidateAS last_candidate_name
          
FROM (SELECT candidate
                     
COUNT (*) AS num_votes
                  
FROM votes
                 
GROUP BY candidate
                
BB
         
GROUP BY num_votes
        
B
     
, (SELECT MAX(num_votesAS max_votes
          
FROM (SELECT candidate
                     
COUNT (*) AS num_votes
                  
FROM votes
                 
GROUP BY candidate
                
CC
        
C
  
WHERE A.candidate B.last_candidate_name
    
AND A.num_votes B.num_votes
    
AND A.num_votes C.max_votes




Although we got the answer using a single query, this seems like quite a lot of work! And we are scanning the "votes" table three times. It would make more sense to store some results in an intermediate table. This would also be a good place to use "Common Table Expressions", a feature introduced with Microsoft SQL Server 2005.

Magazine Subscription Problem

The preceding example has a single group. But what if we have many groups, and we want to find the most popular selection within each group? The magazine subscription problem is a good illustration of this situation:

Subscriber ZIP code Magazine
John Q. Public 20001 Life
John Q. Public 20001 Washingtonian
Ennio Citizen 20001 New Yorker
Ennio Citizen 20001 Washingtonian
Mary J. Student 90210 Atlantic
Mary J. Student 90210 Chronicle
Mary J. Student 90210 Life
Alfred E. Oldman10016New Yorker
Alfred E. Oldman10016Time
Carter Perch90210New Yorker
Carter Perch90210Time
Carter Perch90210Mad
Carter Perch90210Atlantic
Roger Clifton20001Vogue
Roger Clifton20001Elle
Eddie Thomas10016New Yorker
Eddie Thomas10016Atlantic
Eddie Thomas10016Time

We want to identify the single most popular magazine within each ZIP code. As before, we need a tiebreak rule, which will just be the magazine name in (ascending) alphabetical order. Here is the "brute force" method:
SELECT A.ZIP_code
     
A.magazine
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A
     
, (SELECT ZIP_code
             
num_subs
             
MIN (magazineAS first_magazine_name
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
BB
         
GROUP BY ZIP_code
             
num_subs
        
B
     
, (SELECT ZIP_code
             
MAX(num_subsAS max_subs
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
CC
         
GROUP BY ZIP_code
        
C
  
WHERE A.ZIP_code B.ZIP_code
    
AND A.magazine B.first_magazine_name
    
AND A.num_subs B.num_subs
    
AND A.ZIP_code C.ZIP_code
    
AND A.num_subs C.max_subs


Again, an awful lot of work and duplicated code.

Here is an alternative method, using SQL Server 2000 and the "IGNORE_DUP_KEY" indexing option:

CREATE TABLE most_popular_magazine_by_ZIP_code
ZIP_code VARCHAR (10)
magazine VARCHAR (30)
)

CREATE UNIQUE INDEX most_popular_magazine_by_ZIP_code_IDX
ON most_popular_magazine_by_ZIP_code
ZIP_code
)
WITH IGNORE_DUP_KEY

INSERT INTO most_popular_magazine_by_ZIP_code
  
SELECT ZIP_code
       
magazine
    
FROM magazine_subscriptions
   
GROUP BY ZIP_code
       
magazine
  
ORDER BY ZIP_code
      
COUNT (*) DESC
      
magazine

SELECT FROM most_popular_magazine_by_ZIP_code


This requires an extra table, a couple of steps, and a Microsoft-specific feature, but for a large data set it will perform better than the "brute force" method. Notice the ORDER BY clause for the INSERT statement - it includes the magazine name as the final sort key, to enforce the tiebreak rule. This is very important because otherwise you will have a non-deterministic result.

Let's see how using Analytic Functions might help.

Analytic functions are also known as Ranking or Windowing functions. In Microsoft SQL Server 2005, they include the functions RANK(), DENSE_RANK(), NTILE(), and ROW_NUMBER(). Oracle has had analytical functions since Version 8i (1999). In addition to the four functions just listed, Oracle includes a much richer set of functions, including CUME_DIST(), FIRST_VALUE(), LAST_VALUE(), and PERCENT_RANK(). Unlike SQL Server, Oracle functions can also operate on a sliding window of rows, so that you can do things like a moving average within a single query.

For the magazine subscription problem, we can use the ROW_NUMBER() function. This function will return a unique, sequential, ascending number for each row in a result set. But it can also operate on groups of rows that are defined by a partition, so that each group has its own set of sequence numbers.

The method is to count the number of subscriptions, grouping by both ZIP_code and magazine name, and then use ROW_NUMBER() to assign a popularity ranking for each magazine
within each ZIP code:
SELECT ZIP_code
     
magazine
     
num_subs
     
row_number()
       
OVER (PARTITION BY zip_code
       
ORDER BY num_subs DESC
           
magazine
             
)  AS mag_rank
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A


Again, notice the use of the magazine name to enforce the tiebreak rule - it is included in the ORDER BY clause within the PARTITION specification. Here is the result:

ZIP_code Magazine num_subs mag_rank
10016 New Yorker 2 1
10016 Time 2 2
10016 Atlantic 1 3
20001 Washingtonian 2 1
20001 Elle 1 2
20001 Life 1 3
20001 New Yorker 1 4
20001 Vogue 1 5
90210 Atlantic 2 1
90210 Chronicle 1 2
90210 Life 1 3
90210 Mad 1 4
90210 New Yorker 1 5
90210 Time 1 6

Now, to get the single most popular magazine for each ZIP, all we have to do is filter on mag_rank = 1:
SELECT ZIP_code
     
magazine AS most_popular_magazine
  
FROM (SELECT ZIP_code
             
magazine
             
num_subs
             
row_number()
               
OVER (PARTITION BY zip_code
               
ORDER BY num_subs DESC
                   
magazine
                     
)  AS mag_rank
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
A
        
B
  
WHERE mag_rank 1



This is even easier in Oracle, where the FIRST_VALUE() function does the filtering for you:
SELECT DISTINCT ZIP_code
     
first_value (magazine)
       
OVER (PARTITION BY ZIP_code
             
ORDER BY num_subs DESC
                 
magazine
             
AS most_popular_magazine
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A


Be careful, you need the DISTINCT qualifier here or you will get duplicate rows.

In Part 2 of this article, we will look at a more complicated example that illustrates how to use a weighting factor and how to create multiple result columns in a single operation.

Labels: , , , , , , ,