Say you’ve got a reporting query that takes a long time to run. I’ll use the Stack Overflow database as an example:
|
1 2 3 4 5 6 7 8 |
select u.displayname, u.location, u.id, sum(1) as acceptedanswers from public.users u join public.posts pAns on u.id = pAns.owneruserid join public.posts pQ on pAns.Id = pQ.acceptedanswerid group by u.displayname, u."location", u.id order by sum(1) desc limit 100; |
That query gets the top 100 users who have the most accepted answers. On my server, that takes about a minute to run – unacceptably long for a public-facing web site, for example.
Does that data need to be up to the second? Of course not. The leaderboard of the top 100 answerers isn’t going to change all that quickly. People who have successfully answered hundreds of questions aren’t going to suddenly disappear, nor is someone else suddenly going to rocket to the top.
For report queries like this, Postgres offers materialized views: a view that’s written to disk, and then updated only when you want to update it. Here’s how to create one:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
drop view if exists mv_top_accepted_answers; CREATE MATERIALIZED VIEW mv_top_accepted_answers AS SELECT u.displayname, u.location, u.id, COUNT(*) AS acceptedanswers FROM public.users u JOIN public.posts pAns ON u.id = pAns.owneruserid JOIN public.posts pQ ON pAns.id = pQ.acceptedanswerid GROUP BY u.displayname, u.location, u.id order by COUNT(*) desc limit 100 WITH DATA; CREATE unique INDEX idx_mv_top_accepted_answers_acceptedanswers_id ON mv_top_accepted_answers (acceptedanswers desc, id); |
(The unique index isn’t strictly required, but more on that later.)
We have to change our query to point to the new materialized view. (Just as a side note, this is one of the cool things you don’t have to do with Microsoft SQL Server Enterprise Edition. It automatically takes the original query, figures out that there’s an indexed view it can use, and automatically rewrites the query plan to use the materialized view.)
|
1 2 3 4 5 6 7 8 9 10 |
SELECT displayname, location, id, acceptedanswers FROM mv_top_accepted_answers ORDER BY acceptedanswers DESC LIMIT 100; |
The query runs instantaneously, and the resulting explain plan shows that the query only hits the materialized view:
We can then update that materialized view on a scheduled basis, like overnight when workloads are lighter. If we have a unique index on the view, we can even update the materialized view’s contents while other people are running queries:
|
1 |
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_top_accepted_answers; |
This is important because the more complex & resource-intensive our materialized views are, the more performance impact we’ll have on the server while we refresh the view. In our case, the query at the top of the post only took about 1 minute to run, but building or refreshing the materialized view – even with just 100 rows – took about 5 minutes to run. Even if no data has changed, the rebuild still takes time because Postgres doesn’t track which rows in the underlying tables changed. It rebuilds the view from scratch.
This particular indexed view is really small because it only has 100 rows. We can see the size with check_indexes:
It’s just 8KB for the materialized view and 16KB for the b-tree index. Of course, if we didn’t have the LIMIT 100 on building the view, all tens of millions of users would be in the view, and it would take longer to build and refresh.
It’s up to us as to when we want to refresh our materialized view – daily, weekly, hourly, etc – and it’s up to us to build the mechanism to refresh that view. We could do it via a cron job, or as part of a nightly set of commands called by our own code.
I wish I could get check_indexes to alert us when a materialized view is out of date, either by checking the number of modified rows in the underlying tables, or their quantities, or the date since the last refresh. Unfortunately, I’m not aware of a reliable way to do that, that doesn’t involve adding extensions to Postgres. If you’ve got ideas, I’d love to hear them in the comments.



4 Comments. Leave new
I’d love to learn of a technique to combine materialised views with the latest data to get a combination of performance and newest data sets. I know timescale is able to do this via a mix of partitioning and using dates to update an older result set, but the details are fuzzy to me.
Another thing that really bothers me about materialised views is that you can’t selectively update it (for example, if a user logs in, I only meed to update the last_login field on that specific user). You can always create a completely separate table, but then it quickly becomes an unmaintainable code spaghetti with all the triggers.
About the first question – from a high level, you would need to combine 3 things:
Then have your query hit that outer view, and it won’t have to be modified each time you update views #1 and #2 with new dates.
…sometimes, the answers to things are right in front of you, staring you in the face
Makes sense, cheers!
[…] Brent Ozar builds a view but a special one: […]