For this week’s Query Exercise, let’s start by taking a look at the Users table to see what we’re dealing with:
|
1 2 3 4 5 |
select u.location, count(*) as population, avg(u.reputation) as AvgReputation from users u group by u.location order by count(*) desc limit 100; |
The results:

That query has a few problems, but hold that thought for a moment. (You’re going to have to solve those problems, but I just wanted to show you the sample data at first to give you a rough idea of what we’re dealing with.)
Our business users are thinking about doing some targeted advertising, and they have two questions.
First, what are the top 10 locations populated with users who seem to be really helpful, meaning, they write really good answers?
Second, what are the top 10 locations where people seem to need the most help? Meaning, they ask a lot of questions, but they do not seem to be answering those of their neighbors?
Your first guess would probably be to adapt the query above and simply sort in different directions, but buckle up. The data presents a lot of challenges that we need to consider.
- There are null and empty locations. Those should probably be excluded.
- In the Users table, the Location column’s datatype is varchar(100). People can type in whatever they want, which leads to a lot of oddball one-off locations. We should filter for a minimum number of people involved, especially since we’re considering paid advertising. Let’s filter for only locations with at least 100 people in them.
- For the sake of this exercise, let’s pretend that each Location value is geographically unique. You’ll notice in the screenshot that there are entries for “India” and for “Bangalore”. You and I know that everyone who lives in Bangalore also lives in India, but for the sake of this question, let’s pretend that’s not the case. In reality, obviously we need to clean up our Location data, but I’ll save that for another challenge.
- High values for Reputation do not necessarily mean that people write really good answers. You can earn Reputation points a lot of ways, including writing good questions.
- You don’t have to join the first and second queries together. Just write one query for the top 10 helpful locations, and another query for the top 10 locations that need the most help. They can be separate queries, and I’m not worried about performance for the sake of this exercise.
Part of your solution is going to involve looking at the Posts table, which stores questions and answers (and other things.) A few columns to consider:
- Posts.PostTypeId = when 1, the Post is a Question. When 2, it’s an Answer.
- Posts.OwnerUserId = the Users.Id who owns the question or answer.
- Posts.Score = the quality of the question or answer.
For this exercise, I’m not expecting a specific “right” or “wrong” answer – instead, for this one, you’re probably going to have a good time sharing your answer in the comments, and comparing your answer to that of others. Feel free to put your queries in a Github Gist, and include that link in your comments. I’ll circle back in a week and write about different ways to solve it. Have fun!
