Finding Sister Locations to Help Each Other: Answers & Discussion

Last Updated: March 4, 2024

This week’s query exercise asked you to find two kinds of locations in the Stack Overflow database:

Locations populated with users who seem to be really helpful, meaning, they write really good answers
Locations where people seem to need the most help, meaning, they ask a lot of questions, but they do not seem to be answering those of their neighbors

In the challenge post, I gave you the same kinds of requirements that our end users would typically give, but not the kinds of requirements that a business analyst might specify. This challenge was actually inspired by a friend of mine who transitioned into an analyst role, and I was reminded of just how hard that work really is. Translating vague requests into detailed specs is a lot harder than it looks, and the wide variety of this week’s answers reflects that.

First problem: understanding the data.

The first request asked for locations with helpful users who write really good answers. There are a lot of ways to interpret that, so let’s start with a query to understand the data:

select u."location", 
	count(distinct u.id) as population,
	count(distinct p.owneruserid) as answerers, 
	count(distinct p.id) as answers,
	sum(p.score) as total_score,
	avg(p.score) as avg_score
from users u
left outer join posts p on u.id = p.owneruserid and p.posttypeid = 2 /* answers */
where u.location <> ''
group by u."location" 
order by count(distinct u.id) desc
limit 20;

select u."location",

count(distinct u.id) as population,

count(distinct p.owneruserid) as answerers,

count(distinct p.id) as answers,

sum(p.score) as total_score,

avg(p.score) as avg_score

from users u

left outer join posts p on u.id = p.owneruserid and p.posttypeid = 2 /* answers */

where u.location <> ''

group by u."location"

order by count(distinct u.id) desc

limit 20;

The results are sorted by population descending:

Some locations have more ANSWERERS. For example, row 17 (Paris, France) has 8,400 people who’ve answered questions – which is more than larger locations like rows 4, 5, 7.

Some locations have more ANSWERS. For example, row 10 (France) has 265k answers – which is more than most of the other rows in the results, even though they’ve got less people answering questions. French users seem to be more prolific.

Some locations have higher SCORES. For example, row 19 (Canada) averages 3.6 points per answer, higher than most of the top 10 locations.

So when we’re faced with finding the most helpful location, there are lots of ways we could sort the data to find it. I don’t have a problem with any particular answer – you could have used any of the above sort options and it’d probably meet some business requirements.

Finding the Most Helpful Locations

After looking at the above data, my line of thinking was, “I want to find locations that produced the most points from answers.” In the above screenshot, that’d be the total_score column, so let’s sort by that:

select u."location", 
	sum(p.score) as total_score,
	count(distinct u.id) as population,
	count(distinct p.owneruserid) as answerers, 
	count(distinct p.id) as answers,
	avg(p.score) as avg_score
from users u
left outer join posts p on u.id = p.owneruserid and p.posttypeid = 2 /* answers */
where u.location <> ''
group by u."location" 
having sum(p.score) > 1
order by sum(p.score) desc
limit 20;

select u."location",

sum(p.score) as total_score,

count(distinct u.id) as population,

count(distinct p.owneruserid) as answerers,

count(distinct p.id) as answers,

avg(p.score) as avg_score

from users u

left outer join posts p on u.id = p.owneruserid and p.posttypeid = 2 /* answers */

where u.location <> ''

group by u."location"

having sum(p.score) > 1

order by sum(p.score) desc

limit 20;

The results:

If you’re from the US, before you go strain your arm from patting yourself on the back so hard, note that other places still produce more answers, like Germany and India. Everybody’s #1 at something, depending on how you abuse your data, ha ha ho ho. I could have easily sorted by a different metric and gotten a different helpful location. Moving on!

Finding the Locations that Need Help

My request was to find places that ask a lot of questions, but do not seem to be answering questions themselves. At first glance, that sounds like, “Where there is a high ratio of questions to answers.” However, if we had a location where a person had asked 10 questions and only typed 1 answer out, then its ratio would be the same as a place where people had posted 10 million questions, and only typed 1 million answers out. Clearly, the latter place is in worse shape – they need 9 million answers!

So I’m going to sort by (questions – answers) descending, and let’s see what we get:

select u."location",
	sum(case when p.posttypeid = 1 then 1 else 0 end) - sum(case when p.posttypeid = 2 then 1 else 0 end) as the_gap,
	sum(case when p.posttypeid = 1 then 1 else 0 end) as questions,
	sum(case when p.posttypeid = 2 then 1 else 0 end) as answers,
	count(distinct u.id) as population
from users u
inner join posts p on u.id = p.owneruserid and p.posttypeid in (1,2) /* questions, answers */
where u.location <> ''
group by u."location" 
order by sum(case when p.posttypeid = 1 then 1 else 0 end) - sum(case when p.posttypeid = 2 then 1 else 0 end) desc
limit 20

select u."location",

sum(case when p.posttypeid = 1 then 1 else 0 end) - sum(case when p.posttypeid = 2 then 1 else 0 end) as the_gap,

sum(case when p.posttypeid = 1 then 1 else 0 end) as questions,

sum(case when p.posttypeid = 2 then 1 else 0 end) as answers,

count(distinct u.id) as population

from users u

inner join posts p on u.id = p.owneruserid and p.posttypeid in (1,2) /* questions, answers */

where u.location <> ''

group by u."location"

order by sum(case when p.posttypeid = 1 then 1 else 0 end) - sum(case when p.posttypeid = 2 then 1 else 0 end) desc

limit 20

The results are really interesting:

Now, first off, we have a theoretical problem. A location could be in the helpful category because they wrote a lot of high-quality answers, thereby earning a lot of points on those – but still have a high number of questions, thus end up in the second result set too! It just so happens that the problem does not surface in our results, but it could, and I feel like I’d be a bad data professional if I didn’t at least mention that.

Next up, notice that several locations in this list have a suspiciously low population! Check out row 3, “blankville” – there’s only one person who set their location to blankville, but they’ve asked an astonishing 2,625 questions and only posted 20 answers.

It’s blankman: https://stackoverflow.com/users/39677/blankman And judging by their questions, I think they’re probably purposely asking easy-to-answer questions in order to farm reputation points. That’s an interesting finding! Scrolling down through the top 20, several other locations/users appear to be after the same farming technique.

So let’s try a different sort order: instead of just sorting by sheer quantity of questions, let’s also get quality of questions involved. In our first question’s answer, we sorted by total points on their answers. Let’s try a similar technique here: let’s sort by (sum.question_score) – (sum.answer_score) to find places where they’re not only asking a lot of questions, but those questions also have high scores because they’re good.

select u."location",
	sum(case when p.posttypeid = 1 then p.score else 0 end) - sum(case when p.posttypeid = 2 then p.score else 0 end) as the_gap,
	sum(case when p.posttypeid = 1 then p.score else 0 end) as total_question_score,
	sum(case when p.posttypeid = 2 then p.score else 0 end) as total_answer_score,
	count(distinct u.id) as population
from users u
inner join posts p on u.id = p.owneruserid and p.posttypeid in (1,2) /* questions, answers */
where u.location <> ''
group by u."location" 
order by sum(case when p.posttypeid = 1 then p.score else 0 end) - sum(case when p.posttypeid = 2 then p.score else 0 end) desc
limit 20

select u."location",

sum(case when p.posttypeid = 1 then p.score else 0 end) - sum(case when p.posttypeid = 2 then p.score else 0 end) as the_gap,

sum(case when p.posttypeid = 1 then p.score else 0 end) as total_question_score,

sum(case when p.posttypeid = 2 then p.score else 0 end) as total_answer_score,

count(distinct u.id) as population

from users u

inner join posts p on u.id = p.owneruserid and p.posttypeid in (1,2) /* questions, answers */

where u.location <> ''

group by u."location"

order by sum(case when p.posttypeid = 1 then p.score else 0 end) - sum(case when p.posttypeid = 2 then p.score else 0 end) desc

limit 20

The results are different:

There are still plenty of locations with 1-2 people in the top 20, indicating that we have a lot of people farming reputation points by asking questions – but the top-voted ones are changing too.

If we want to filter out the farmers, we could try a minimum population count – but the farmers might just be hiding in regular locations, too. I’d say this is a good-enough result set to get our analysts started, and I’d let them decide how to filter out the farmers.

I hope you enjoyed this week’s challenge! The Query Exercises will be taking a break for the next couple of weeks because I’m traveling to Hong Kong and Shanghai. I’ve still got plenty of blog posts queued up, but the interactive exercises will return in April when I’m back. If you’re starving for homework, check out the prior Query Exercises posts and catch up on the ones you missed so far.

Query Exercise: Find Sister Locations to Help Each Other

Smart Postgres News and Techniques 2024 Vol 8