Query Exercise: Find Posts with the Wrong CommentCount

For this week’s query challenge, we’re going to do some data validation. Pop open the Stack Overflow database and check out the commentcount column on the posts table:

In theory, that’s a fast way to check how many comments a particular post has, without querying the comments table to find them all. For example, if I want to see the comments for post.id 13204792, I might write:

select * from comments where postid = 13204792;

1	select * from comments where postid = 13204792;

Which does indeed produce 10 rows, matching the commentcount = 10 value in the above screenshot:

But… I’m a DBA, and you know what DBA stands for: Doesn’t Believe Anyone. So our query exercise for this week is to:

Write a query that finds the top 100 problematic posts with the biggest commentcount variance versus the actual number of comments on that post. This is a little tricky for two reasons: the posts.commentcount could be off in either direction, AND these tables are ginormous, so your queries are going to take a while. You’ll want to put in some thought before you just go hit execute.
Think about a long term solution to keep that posts.commentcount as up to date as practical. We’re not worried about a one-time update – we’re more concerned with making sure the data is relatively accurate. It doesn’t have to be transactionally consistent here – these aren’t bank balances, after all.

Have fun, and when you’ve given it a shot, check out our discussion of solutions.

Query Exercise: Find Posts with the Wrong CommentCount

Leave a Reply Cancel reply