Query Exercise: Find Tagged Questions Faster.

2 Comments

For this week’s Query Exercise, we’re going to start with an existing query that has performance problems from time to time, depending on how it’s used.

This query’s goal is to find the top-voted Stack Overflow questions for any given tag – for example, here are the top-voted questions tagged “postgresql”.

What are tags? I’m glad you asked. When you’re looking at a question at Stack Overflow, you’ll see a list of tags for that question:

The tags are at the bottom of that screenshot: database, postgresql, command, database-table, dbtable. A question can have up to 5 tags. To see how they’re stored in the Stack Overflow database, let’s look at the posts table, which is where questions & answers are stored:

If we scroll across to the right, we see a tags column:

So to find the questions tagged postgresql, our analyst wrote this query:

And our DBA helped by creating this index:

Right now, the execution plan is fairly simple. Postgres scans that above index backwards, going from the top-scoring posts down, checking each one’s tags to see if it contains the term “<postgresql>”.

Your exercises for this week are:

  1. What kinds of tags will perform worse than others for this query?
  2. Could you change the query to perform better?
  3. Could you change the indexes to perform better, without changing the table structure?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. You can also link to code in Github Gists.

We’ll revisit your answers in this post. Have fun!

Previous Post
Find the Best Time for Maintenance: Answers & Discussion
Next Post
SQLAI.ai Review: Using AI to Generate & Tune Queries

2 Comments. Leave new

  • I like your blog. These exercise questions are a cool idea.

    So, for this week’s questions:

    1. What kinds of tags will perform worse than others for this query?
    Any tag that has a low amount of highly-upvoted questions (to be precise, that has less than 100 highly-upvoted questions) will make the query perform badly. This is because all questions need to be scanned, no matter whether they refer to the tag or not, until 100 relevant questions have been found. For example, on your test database, finding the 100 top questions tagged “postgresql” takes about 1 second, but finding the 100 top questions tagged “ocaml” takes about 12 seconds.

    2. Could you change the query to perform better?
    I can’t think of any way to do so. There’s no more efficient way to scan the table, as the index is essentially useless for finding questions with a particular tag.

    3. Could you change the indexes to perform better, without changing the table structure?
    The desire here is to swap the two indexed columns, i.e. index ON posts(tags, score) rather than ON posts(score, tags). Ideally, this would allow us to first find all questions with a particular tag, and then have them sorted by score as well. Unfortunately, this is complicated by the fact that the tags are all concatenated into a single column, such that a simple BTree column index on the tags column doesn’t actually help in any way — we still can’t find questions with a particular tag, we just have an alphabetical ordering by the first tag. This is useful for finding questions with a particular first tag, but the tag we’re looking for doesn’t have to be the first tag of any particular question.
    What we would like to have is a GIN index on the tags, which would allow us to check whether a particular tag is an element in the tags array, only requiring a bitmap scan on the index. Now, the tags aren’t an array, but we can fix that using an expression index. Something like this:

    create index posts_tags
    on posts
    using gin (regexp_split_to_array(trim('><'));
    

    This will first split the tags into an array, then index on the tags. We can now use the same expression to find all questions with a particular tag:

    select *
    from posts
    where (regexp_split_to_array(trim('> '{java}'
    order by score desc
    limit 100;
    

    In some cases, the amount of posts for a particular tag will be very high. It might therefore be reasonable to also add a BTree index on the score column — the posts_score_tags index in the article already provides this. This may help Postgres avoid an explicit sort.

    I have not tested this, performance-wise, as I don’t have a local copy of the Stackoverflow dump and the demo server doesn’t allow users to create indices.

    Reply

Leave a Reply to Brent Ozar Cancel reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed