Your Query Exercise was to find denormalization accuracy problems: checking the accuracy of a reporting column: posts.commentcount. There were two parts: finding the top 100 most problematic posts with the biggest variances, and thinking about a long term solution to keep the commentcount accuracy as high as practical.
Question 1: Finding the Problematic Posts
Your first attempt might have looked something like this:
|
1 2 3 4 5 6 7 8 9 |
select p.id, count(distinct c.id) as comments_actual, p.commentcount, count(distinct c.id) - p.commentcount as variance from posts p inner join comments c on p.id = c.postid group by p.id, p.commentcount having count(distinct c.id) <> p.commentcount order by count(distinct c.id) - p.commentcount desc limit 100; |
Which does indeed show that we’ve got some problems:
But that query has two problems. First, because we’ve got an inner join to comments, we’re going to miss situations where posts.commentcount reports that we have comments, but there aren’t any rows in the comments table at all. We’re going to need a left outer join.
Second, there’s a problem with this line:
|
1 |
order by count(distinct c.id) - p.commentcount desc |
is that it focuses on variances where we have more comments than p.commentcount reflects – but not the other way around. What about cases where p.commentcount is overstating things?
Your next approach might have been two different queries or CTEs, unioned together, with one sorting for higher p.commentcounts, and the other sorting for higher actual comments. However, given the size of these tables, that’d be a performance nightmare, making multiple passes through the same tables. (You were probably already frustrated with how long results took on this one!)
A slick way to do it is to sort by the absolute value (abs) of the variance, like this:
|
1 2 3 4 5 6 7 8 9 |
select p.id, count(distinct c.id) as comments_actual, p.commentcount, ABS(coalesce(count(distinct c.id),0) - p.commentcount) as variance from posts p left outer join comments c on p.id = c.postid group by p.id, p.commentcount having coalesce(count(distinct c.id),0) <> p.commentcount order by ABS(coalesce(count(distinct c.id),0) - p.commentcount) desc limit 100; |
I’ve also switched the inner join to a left outer, and I’ve added a coalesce around the count of comments so that we handle situations where the count distinct is null.
It turns out that our biggest variances are cases where p.commentcount is overestimating the number of comments:
But if you scroll down through the result set, there are indeed variances in both directions, even in just the top 100 rows:
Okay, so part 1 is done: we definitely have a problem. What should we do long term?
Question 2: Solving It Long Term
Your homework assignment specifically skipped writing a one-time update statement. I asked you to focus on how we should keep that posts.commentcount column as up to date as practical, reminding you that it wasn’t a transactional system like bank balances.
When we wanna keep data up to date between two tables, we have 3 basic options:
- Synchronously, done by the database server itself – which would make insert transactions longer when people add new comments, plus add workload to the database server to coordinate that transaction. This would typically be accomplished with a trigger on the Comments table so that as soon as rows were deleted, updated, or inserted, the Posts.CommentCount could be updated at the same time, and always in sync. For writing triggers like this, give ChatGPT that work to do like this, like we talked about in this week’s news and techniques.
- Asynchronously, done by the database server – add a trigger to the comments table, but rather than updating Posts.CommentCount right away, simply add a row to a queue table indicating that the post.id needs its comment count updated. Later, circle back with an async job to run the commentcount updates only for the posts listed in the queue table.
- Asynchronously, done by the app – instead of a trigger, have the front end app (or service layer) add a record to something whenever it inserts a comment. That something doesn’t even have to be a relational database – it could be a queue service or a key/value store. Then, have a service check that queue later and run update queries only for the specific posts listed in the queue.
The advantage of the async approach is that we only have to update a posts row once every X minutes, no matter how many comments have been added to it. If comments come in fast & furious in the first, say, hour of a post’s existence, then we can reduce that workload with a queue.
In fact, if you check out the above screenshot with variances, you’ll notice that the largest variances all seem to have a similar range of ids, and they’re all in the high range – some of the more recently inserted posts, basically. That would seem to suggest an asynchronous process.
Your long term solution might have been just that! You might have recommended a periodic job, say hourly, to update the posts.commentcount column. Depending on a few things, that might be fine:
- If we’re willing to let the data be out of sync on newly created posts
- If we don’t have too many posts columns to update (you saw how slow these queries were – imagine how bad it’d be if we had a dozen similar reporting columns to keep in sync)
- If the regular process doesn’t cause blocking problems when it runs for a long time
- If we have enough database server horsepower to get ‘er done without too much impact to end users
On the other hand, you could also choose the synchronous approach: every time we insert a new comment row, update the posts.commentcount row for the post we’re commenting on. This could either be done with a trigger on comments, or with an added update statement on whatever app code is inserting the comments. A big advantage of the sync approach is that the reporting data is always up to date. Business users like that.
Generally, I’m a fan of keeping the database simple for scalability purposes: start with the synchronous approach, done in app code. When you hit a performance wall where that stops scaling, and if this particular process is the bottleneck holding you back from going faster, only then do something more complicated like an async batch job, after making sure the business users are okay with the data being out of sync.
How’d you do? What’d you think?
I hope you had fun, enjoyed this exercise, and learned a little something. If you’ve got comments or questions, hit the comment section below.




1 Comment. Leave new
[…] a pretty good start! It looks correct, but unfortunately it hits the same problem we discussed in the Query Exercise answers: it doesn’t handle situations where there are no rows in the comments table, but […]