2009/01/26

The "Popularity Problem": An Application of SQL Analytic Ranking Functions (Part 1)

This is the first in a series of occasional articles on SQL features and coding techniques.

A situation that comes up frequently in my work is the selection of a "primary" or "dominant" attribute for an individual or group. Some examples are:
  • Find the most popular magazine subscription in a ZIP code
  • Identify a store where a customer spends the most money
  • Who won the election?
In these examples, there is a one-to-many relationship that we want to convert into a one-to-one relationship. We start with a data set having a grain of one-to-many, and we want to reduce this by selecting the most popular choice. In this article, I'll discuss some variations on the problem, how to do it in SQL the "hard way", and how to use SQL Ranking functions to get to the solution easily.

A Simple Case - Who won the Election?

Let's start with the simplest variant of the problem, the election. Here, there is a population of people, and a set of candidates. We assume there is one vote per person, and everyone's vote counts equally. A sample data set might look like this:

Voter Candidate
John Q. Public John McCain
Ennio Citizen Barack Obama
Mary J. Student Barack Obama
Alfred E. Oldman Ralph Nader
Carter Perch Bob Barr
Roger Clifton Barack Obama
Eddie Thomas John McCain

To find the winner, we have to count how many votes each candidate received, then sort on the total number of votes (highest to lowest). The person at the top of the list is the winner. In fact, if we don't care who came in second, third, or fourth, then we don't need to sort; we can just choose the candidate having the highest number of votes. This is a simple and obvious example, but it illustrates some concepts that we will need later.

In any version of SQL, we can write a one-statement query to list the candidates in order:
SELECT candidate
  
FROM votes
  
GROUP BY candidate
  
ORDER BY COUNT(*) DESC


You might already see a problem. Who came in third? The SQL query we just ran will give a non-deterministic answer. Since we have no way to handle a tie, the query could put either Mr. Nader or Mr. Barr in third place, and there is no way to tell in advance what the result will be. In fact, you could run the exact same query twice, and get two different results.

So we need a tiebreak rule. In every problem of this type, a tiebreak rule is required in order to get consistent, reproducible results. Also, the tiebreak rule must be sufficient to completely disambiguate the candidates. In SQL terms, the tiebreak rule must be created using a set of columns that uniquely identify each row. This is also known as a candidate key.

In our simple election example, we can get by if we add a secondary sort, using the candidate's name. It is arbitrary whether to sort alphabetically ascending or descending (or using some other method, like the reverse of the string).
The sorting method will of course affect the result, and care should be taken to choose a fair (or at least reasonable) method if possible. But a choice must be made.

Here is our finished query to give the election results:
SELECT candidate
  
FROM votes
 
GROUP BY candidate
 
ORDER BY COUNT(*) DESC
     
candidate DESC



How would we get back a single row, showing only the winner? The easiest way would be to use the "TOP 1" option with the above query. Another way would be to assign a sequential ranking number to each candidate, and then select the one whose rank=1. However if your version of SQL does not have the functionality to do either of those, but does allow subqueries, you could still get a solution using "brute force":
SELECT A.candidate
  
FROM (SELECT candidate
             
COUNT (*) AS num_votes
          
FROM votes
         
GROUP BY candidate
        
A
     
, (SELECT num_votes
             
MAX (candidateAS last_candidate_name
          
FROM (SELECT candidate
                     
COUNT (*) AS num_votes
                  
FROM votes
                 
GROUP BY candidate
                
BB
         
GROUP BY num_votes
        
B
     
, (SELECT MAX(num_votesAS max_votes
          
FROM (SELECT candidate
                     
COUNT (*) AS num_votes
                  
FROM votes
                 
GROUP BY candidate
                
CC
        
C
  
WHERE A.candidate B.last_candidate_name
    
AND A.num_votes B.num_votes
    
AND A.num_votes C.max_votes




Although we got the answer using a single query, this seems like quite a lot of work! And we are scanning the "votes" table three times. It would make more sense to store some results in an intermediate table. This would also be a good place to use "Common Table Expressions", a feature introduced with Microsoft SQL Server 2005.

Magazine Subscription Problem

The preceding example has a single group. But what if we have many groups, and we want to find the most popular selection within each group? The magazine subscription problem is a good illustration of this situation:

Subscriber ZIP code Magazine
John Q. Public 20001 Life
John Q. Public 20001 Washingtonian
Ennio Citizen 20001 New Yorker
Ennio Citizen 20001 Washingtonian
Mary J. Student 90210 Atlantic
Mary J. Student 90210 Chronicle
Mary J. Student 90210 Life
Alfred E. Oldman10016New Yorker
Alfred E. Oldman10016Time
Carter Perch90210New Yorker
Carter Perch90210Time
Carter Perch90210Mad
Carter Perch90210Atlantic
Roger Clifton20001Vogue
Roger Clifton20001Elle
Eddie Thomas10016New Yorker
Eddie Thomas10016Atlantic
Eddie Thomas10016Time

We want to identify the single most popular magazine within each ZIP code. As before, we need a tiebreak rule, which will just be the magazine name in (ascending) alphabetical order. Here is the "brute force" method:
SELECT A.ZIP_code
     
A.magazine
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A
     
, (SELECT ZIP_code
             
num_subs
             
MIN (magazineAS first_magazine_name
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
BB
         
GROUP BY ZIP_code
             
num_subs
        
B
     
, (SELECT ZIP_code
             
MAX(num_subsAS max_subs
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
CC
         
GROUP BY ZIP_code
        
C
  
WHERE A.ZIP_code B.ZIP_code
    
AND A.magazine B.first_magazine_name
    
AND A.num_subs B.num_subs
    
AND A.ZIP_code C.ZIP_code
    
AND A.num_subs C.max_subs


Again, an awful lot of work and duplicated code.

Here is an alternative method, using SQL Server 2000 and the "IGNORE_DUP_KEY" indexing option:

CREATE TABLE most_popular_magazine_by_ZIP_code
ZIP_code VARCHAR (10)
magazine VARCHAR (30)
)

CREATE UNIQUE INDEX most_popular_magazine_by_ZIP_code_IDX
ON most_popular_magazine_by_ZIP_code
ZIP_code
)
WITH IGNORE_DUP_KEY

INSERT INTO most_popular_magazine_by_ZIP_code
  
SELECT ZIP_code
       
magazine
    
FROM magazine_subscriptions
   
GROUP BY ZIP_code
       
magazine
  
ORDER BY ZIP_code
      
COUNT (*) DESC
      
magazine

SELECT FROM most_popular_magazine_by_ZIP_code


This requires an extra table, a couple of steps, and a Microsoft-specific feature, but for a large data set it will perform better than the "brute force" method. Notice the ORDER BY clause for the INSERT statement - it includes the magazine name as the final sort key, to enforce the tiebreak rule. This is very important because otherwise you will have a non-deterministic result.

Let's see how using Analytic Functions might help.

Analytic functions are also known as Ranking or Windowing functions. In Microsoft SQL Server 2005, they include the functions RANK(), DENSE_RANK(), NTILE(), and ROW_NUMBER(). Oracle has had analytical functions since Version 8i (1999). In addition to the four functions just listed, Oracle includes a much richer set of functions, including CUME_DIST(), FIRST_VALUE(), LAST_VALUE(), and PERCENT_RANK(). Unlike SQL Server, Oracle functions can also operate on a sliding window of rows, so that you can do things like a moving average within a single query.

For the magazine subscription problem, we can use the ROW_NUMBER() function. This function will return a unique, sequential, ascending number for each row in a result set. But it can also operate on groups of rows that are defined by a partition, so that each group has its own set of sequence numbers.

The method is to count the number of subscriptions, grouping by both ZIP_code and magazine name, and then use ROW_NUMBER() to assign a popularity ranking for each magazine
within each ZIP code:
SELECT ZIP_code
     
magazine
     
num_subs
     
row_number()
       
OVER (PARTITION BY zip_code
       
ORDER BY num_subs DESC
           
magazine
             
)  AS mag_rank
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A


Again, notice the use of the magazine name to enforce the tiebreak rule - it is included in the ORDER BY clause within the PARTITION specification. Here is the result:

ZIP_code Magazine num_subs mag_rank
10016 New Yorker 2 1
10016 Time 2 2
10016 Atlantic 1 3
20001 Washingtonian 2 1
20001 Elle 1 2
20001 Life 1 3
20001 New Yorker 1 4
20001 Vogue 1 5
90210 Atlantic 2 1
90210 Chronicle 1 2
90210 Life 1 3
90210 Mad 1 4
90210 New Yorker 1 5
90210 Time 1 6

Now, to get the single most popular magazine for each ZIP, all we have to do is filter on mag_rank = 1:
SELECT ZIP_code
     
magazine AS most_popular_magazine
  
FROM (SELECT ZIP_code
             
magazine
             
num_subs
             
row_number()
               
OVER (PARTITION BY zip_code
               
ORDER BY num_subs DESC
                   
magazine
                     
)  AS mag_rank
          
FROM (SELECT ZIP_code
                     
magazine
                     
COUNT (*) AS num_subs
                  
FROM magazine_subscriptions
                 
GROUP BY ZIP_code
                     
magazine
                
A
        
B
  
WHERE mag_rank 1



This is even easier in Oracle, where the FIRST_VALUE() function does the filtering for you:
SELECT DISTINCT ZIP_code
     
first_value (magazine)
       
OVER (PARTITION BY ZIP_code
             
ORDER BY num_subs DESC
                 
magazine
             
AS most_popular_magazine
  
FROM (SELECT ZIP_code
             
magazine
             
COUNT (*) AS num_subs
          
FROM magazine_subscriptions
         
GROUP BY ZIP_code
             
magazine
        
A


Be careful, you need the DISTINCT qualifier here or you will get duplicate rows.

In Part 2 of this article, we will look at a more complicated example that illustrates how to use a weighting factor and how to create multiple result columns in a single operation.

Labels: , , , , , , ,

2009/01/11

Six Foot Column of Meat Pies

Have you ever heard something in passing, and thought to yourself, "Why, that sounds interesting, but I don't know much about it - I'll look it up later on Google"? Well it happens to me all the time. Natural curiosity I guess, and with Google you can pretty much find anything - that is, anything that actually exists. In fact I located a long-lost high-school sweetheart, and eventually married her, through Google ... now she calls me "Mr. Google" ... but that is another story.

What does it mean when you go to Google, search on your keywords, and turn up absolutely nothing? I mean, ZERO hits? Doesn't even make the list? Doesn't even show up on Google Trends?

Shortly before Thanksgiving last year, I was driving my stepson to school. While en-route, he likes to punch the radio buttons like crazy, veering wildly from station to station. In between Led Zeppelin on one station, and Black Sabbath on another, this phrase popped out: "Six foot column of meat pies". I tried to grab the button and turn back to that station, because wow, how could I pass up the opportunity to see something like that? Hey, I've seen James Brown, the Grand Canyon, and the Eiffel Tower, but I've never seen a column of meat pies, much less one that is taller than me!

But we were too late - we didn't know what station it came from, and the commercial or whatever it was must have finished, before we could scan enough stations to find it again.

Now where would you find a column of meat pies? A bakery? A restaurant? Surely it is some local business with a Thanksgiving promotion, I thought. And if they are advertising on the radio, it would certainly be on the Internet. I just hoped I wouldn't find thousands of meat pie dealers, and be unable to tell which one put out the commercial on our local station!

So, as you must have guessed, when we got home I Googled it. Nothing. OK, so there must be a couple of days of lag time - so I tried a few days later. Still nothing. "Six foot column of meat pies, six foot column of meat pies, six foot column of meat pies ..." say it over and over, and you get into kind of a chanting, conga-line rhythm, that you might find stuck in your head. Every week or so I would check Google again. Every month. Nothing. Nothing. Nothing. How could this be?

My stepson was forming a rock band, and they needed a name. "How about 'Six Foot Column of Meat Pies'?" I suggested.

Then one day I tried using the numeral 6, instead of the word "Six". Don't know why this never occured to me before - and besides, doesn't Google handle this kind of substitution automatically? I know it checks alternate spellings and puts in singular when you mean plural ....

Well finally the mystery is solved. No, there is no local bakery or restaurant specializing in meat pies. Nobody in this country even eats meat pies (well, almost nobody). It was a news story, originating in England (who could have guessed that?), with the headline "My six years of agony over a meat pie". And what's worse, it turns out the "column of meat pies" wasn't really six feet high at all - it was 1.8 meters! Try doing a conga-line to that.

As disappointed as I am, since I now realize I will likely never actually see a six foot column of meat pies, I can at least console myself with the thought that Google will eventually pick up this article. Then there will at last be something to find, if there is any other poor soul out there still looking!

---

On 2009/01/12 at 11:35 AM, the author added this Followup:

Within 15 hours, Google came through. Searching on "six foot column of meat pies" (including the quotes) now points to this page.

---

On 2009/01/15 at 1:46 AM, the author added this Followup:

How fleeting is fame. Three days later, and my same Google search turns up empty again.

---

On 2009/01/27 at 7:42 AM, the author added this Followup:

Well, whaddya know. My Google search hit is back. And I turned up on Technorati.

Labels: , ,

2008/07/10

Thanks In Advance for Reading My Blog

Several times a day I get requests that end with some variation of "Thanks in advance". I understand that the correspondent is trying to be polite, but I can't help feeling a bit uncomfortable with this expression. I remember reading a Miss Manners column on this topic years ago - sadly I can't find it now - but the idea has stuck with me that it is bad form to thank someone "in advance".

In search of some justification for this position, I turned to Google and discovered that there is some debate.

Western Michigan University says:
... avoid thanking the reader in advance ... Thanking in advance is discourteous because it presumes that the reader has no choice other than to respond.

The Minneapolis Star Tribune says:
Don't ... Use the phrase “thank you in advance” in your complimentary close. It’s standard practice to encourage readers to take a desired action by thanking them before they’ve actually done it, but “in advance” can sound presumptuous.

WriteExpress says:
Avoid thanking the person beforehand ... To do so is presumptuous and suggests you do not feel the need to write a follow-up letter.

Apparently in Austrailia it is acceptable in some situations
Is it polite or presumptuous when you thank people in advance?

... but not in New Zealand:
Is the phrase "Thanks in advance" offensive?

The 1913 book Composition Planning by John Baker Opdycke says:
"Thanking in advance" smacks something too much of the spirit of forcing ... the reader ... into granting our request
But not everyone agrees, as you can see from the following:
Finally, for a light-hearted view, there is this from The Cynic:
Thank You in Advance

After reviewing these references, my opinion hasn't changed. One should avoid using the phrase "thanks in advance", if only because it could offend some of your audience. On the other hand, one should try not to be offended if someone says it to you, because they probably only intend it as a courtesy.

And finally it seems like "thanks in advance" has turned into one of those ubiquitous expressions that it won't be possible to avoid. At least it's not outright wrong, like using "loose" when you mean "lose", or "tact" instead of "tack"!


Labels: , , , , ,