For this week’s query challenge, we’re going to do some data validation. Pop open the Stack Overflow database and check out the commentcount column on the posts table:
In theory, that’s a fast way to check how many comments a particular post has, without querying the comments table to find them all. For example, if I want to see the comments for post.id 13204792, I might write:
|
1 |
select * from comments where postid = 13204792; |
Which does indeed produce 10 rows, matching the commentcount = 10 value in the above screenshot:
But… I’m a DBA, and you know what DBA stands for: Doesn’t Believe Anyone. So our query exercise for this week is to:
- Write a query that finds the top 100 problematic posts with the biggest commentcount variance versus the actual number of comments on that post. This is a little tricky for two reasons: the posts.commentcount could be off in either direction, AND these tables are ginormous, so your queries are going to take a while. You’ll want to put in some thought before you just go hit execute.
- Think about a long term solution to keep that posts.commentcount as up to date as practical. We’re not worried about a one-time update – we’re more concerned with making sure the data is relatively accurate. It doesn’t have to be transactionally consistent here – these aren’t bank balances, after all.
Have fun, and when you’ve given it a shot, check out our discussion of solutions.


