The Fundamentals of Select

You’re a developer who stores data in PostgreSQL. You’re comfortable writing queries, perhaps making indexes.

But you’re UNcomfortable explaining why a query runs slow. You haven’t been to a formal PostgreSQL class since college, you don’t read query plans, you’ve never read the Postgres source code (and you don’t wanna.)

You just want to quickly learn:

How queries use tables & indexes
Where to look when a query is too slow
What indexes exist on a table, and how they’re used

Let’s get started! You can either watch the video below, or keep reading – both methods cover the same concepts.

Meet the Stack Overflow users table.

In this class, like with all my training classes, I’m using the free Stack Overflow Postgres database. It’s built from StackOverflow.com‘s free public data dump with the contents of their database. I love using it for demos because it’s real-world data with real-world problems.

The users table holds exactly what you think it holds – and you can click on this to see the full size image:

There are columns for display name, website url, location, and more. Let’s say I’m looking for the people who live near me:

select * from users where location = 'Las Vegas, NV, USA';

1	select * from users where location = 'Las Vegas, NV, USA';

Run that query a few times, and you’ll notice that the order changes each time, with different users at the top of the result set each time. The above screenshot looks like it’s ordered by users.id ascending, starting with around 4,000, but run it again:

And a different set of users show up at the top. That’s because in Postgres, order isn’t guaranteed unless you put an order by on your query. Let’s fix that by finding the users with the best reputation scores in town, and just the top 100:

select * 
    from users 
    where location = 'Las Vegas, NV, USA' 
    order by reputation desc 
    limit 100;

select *

from users

where location = 'Las Vegas, NV, USA'

order by reputation desc

limit 100;

Run that a few times, and you’ll notice that it takes a second or two each time it runs. Why?

How to Find Out What a Query is Doing

When you wanna find out what work a query is doing, you can ask Postgres to explain it for you. Just put the word explain before your query:

explain 
select * 
    from users 
    where location = 'Las Vegas, NV, USA' 
    order by reputation desc 
    limit 100;

explain

select *

from users

where location = 'Las Vegas, NV, USA'

order by reputation desc

limit 100;

If you’re a Postgres guru, you might be comfortable reading those textual results. I’m a visual person, so I like plan visualizers. There are a lot of free ones out there, and at the moment, I prefer explain.dalibo.com, aka Dalibo’s PEV2.

Warning: anytime you paste anything into a web browser, you need to know whether that data is leaving your computer and going out to the internet. In the case of plan visualizers, your query and its plan may be posted publicly. If your query is private corporate material, or if it has things like client names or patient names as part of the query (think where clause parameters, for example), then you probably want to run the visualizer locally. Read the Github repo instructions on how – it’s really easy, just involves downloading a single HTML file locally.

Here’s what a visual plan looks like with Dalibo’s PEV2:

Whoa! That’s awesome, and it’s a lot more detail with helpful icons pointing out where problems are. To get that level of detail, we need to run our query with a few more explain options:

explain (analyze, buffers, costs, verbose, format json)
select * 
    from users 
    where location = 'Las Vegas, NV, USA' 
    order by reputation desc 
    limit 100;

explain (analyze, buffers, costs, verbose, format json)

select *

from users

where location = 'Las Vegas, NV, USA'

order by reputation desc

limit 100;

That’ll produce JSON that you can copy/paste out into explain.dalibo.com:

Click Submit, and you’ll get the visualized plan:

Click on the little down arrows on each rectangular operator on the right hand side to expand their details:

Reading the Explain Plan

When you’re troubleshooting a slow plan, look for the operators with warning icons:

Clock: means Postgres spent the most time on this operator (doing this thing)
$: Postgres thought this part would be the most expensive
Funnel: Postgres read a ton of rows, but filtered most of them out, indicating that we’re reading more data here than we needed to read

In our query’s case, there’s one operator that has all three icons: the Parallel Seq Scan.

The explanation on there is pretty clear: Postgres is having to scan the entire table (public.users) in order to find the people who live in Las Vegas.

To understand why, we need to look at how Postgres actually stores data on disk.

How Postgres Stores & Reads a Table

By default, Postgres stores data in 8KB pages. Think of them as like pages of a spreadsheet, or pieces of paper, and they look like this:

You can download that PDF here, and as you go through the training courses here on the site, it’ll be useful to help understand why Postgres does what it does.

For simplicity’s sake here, I’m using an actual spreadsheet page. Postgres wouldn’t leave that much empty space on a page, nor would it organize the data into columns. After all, a row doesn’t cleanly fit left to right on a page – notice how a lot of stuff is cut off here, including the about. We’ll cover more details about data types in our other Fundamentals classes. For now, just think of the data as a page like the above screenshot.

The data looks like it’s sorted by the ctid column. The ctid is actually Postgres’s internal identifier: the first number (0) indicates the page number, and the second column (like 1, 2, 3, etc) indicates the row number on the page. (Yes, gurus, it’s technically a tuple number – but again, we’re keeping this simple for this class.)

The important part to note is that the table isn’t sorted by location. If we want to find the people who live in Las Vegas, we’re gonna have to scan the entire table, checking the location of each person to see whether they live in Las Vegas.

That’s why this operator is so expensive and time-consuming:

We had to read all 22 million of our users, and we filtered out 99% of them because they don’t live in Las Vegas. If we want this to go faster, we need a copy of our data sorted by location so we can jump straight to the Las Vegas folks.

How Postgres Stores Indexes

If we want a copy of our data sorted in another way, it’s up to us to create that copy. Here’s how:

create index users_location_reputation on public.users (location, reputation);

1	create index users_location_reputation on public.users (location, reputation);

I created that index on location AND reputation because when I jump to Las Vegas, I want those users already sorted by reputation, because my query is asking for the top-ranked reputation folks. Let’s get that sorting out of the way in the index.

Creating the index takes a minute or two because Postgres is literally reading the entire users table, writing out all of the locations & reputations, sorting them, and writing the sorted copy to disk. When it’s done, Postgres has created another data structure that looks like this:

An index really is another copy of the table, but you get to choose which columns are used for the sorting, and which columns are included in the 8KB page.

One column you can’t choose, and is included regardless, is that magical hidden ctid that we mentioned earlier. When Postgres finds an interesting row in the index, like one of those people who live in Las Tunas, Cuba, then Postgres has to be able to jump over to the table itself to find the rest of the columns for that row. Like, what are those people’s names who live in Las Tunas? To do that, it reads the ctid from the index, and then jumps to that particular page number and row number on the table itself.

Now that I have the index, how does my query perform?

How Postgres Reads Data From Indexes

After the index is created, the query runs nearly instantaneously. Before, it was taking about a second, but now, look at the execution time (at the top left) on the new explain plan after we paste it into Dalibo:

The query now runs in less than a millisecond. People probably wouldn’t bring us this query calling it “slow”, but as long as we’re here, let’s look at the new explain plan for its differences.

The old plan’s slowest operator (the parallel seq scan) had 3 icons: a clock, a $, and a filter. The filter icon is gone now because we’re not reading extra rows. Postgres used the index – note the “Index Scan” operator says “using users_location_reputation.” It didn’t have to scan the entire table to find the people who live in Las Vegas – it just dive-bombed into Las Vegas and only read the rows that matched. No filtering was involved.

Postgres didn’t only use the index, however, because our query asked for all of the columns. Our index doesn’t have all the columns. That’s why the “Index Scan Node” note on the visualizer says that the index scan actually performs 2 read operations: one from the index (aka the black page of our PDF), and one from the table (the white page.)

The new plan doesn’t have a sort because the data in the index is already sorted by location, then reputation. When we jump to the people in Las Vegas, we can jump to the last one and read backwards from the highest-ranking reputation to the lowest. You can see that if you click the Misc tab of the Index Scan:

Note the “Scan Direction” = Backward. Once Postgres dive-bombed into Las Vegas, it could jump to the last row and scan backwards until it found the first 100 rows that matched.

Indexes Make Queries Go Faster. Check Yours.

Indexes are pre-sorted copies of your table to help queries run more quickly.

If you don’t have any indexes, then Postgres is going to end up scanning your whole table looking for the data that matches your query, and then sorting that data. That’s especially problematic as you pile more joins onto your query. Joins need data sorted in order to merge the data from multiple tables together.

Wanna find out what indexes exist on a table? Check out check_indexes:

select * from check_indexes('public', 'users');

1	select * from check_indexes('public', 'users');

The results:

Check_indexes is a free, open source function we provide as part of the Smart Postgres Bag of Tricks. It lists the table itself, its indexes, and their sizes.

The ordinary table is 2.8GB, and our index on location & reputation is another 188MB. The more indexes you have:

The larger your database becomes
The slower your inserts/updates/deletes go, because Postgres has to maintain all those copies of the data at the same time
The longer your backups & restores take
The longer your vacuum & analyze maintenance processes take

But, on the plus side, indexes make your queries go faster.

It’s tempting to add an index every time your query is slow, but it’s a balancing act. The first few indexes produce great results, so don’t be afraid to add a few indexes to help as many queries as possible. However, just don’t go adding indexes on every column, because you’ll end up with new problems on inserts/updates/deletes.

The Fundamentals of Select

Meet the Stack Overflow users table.

How to Find Out What a Query is Doing

Reading the Explain Plan

How Postgres Stores & Reads a Table

How Postgres Stores Indexes

How Postgres Reads Data From Indexes

Indexes Make Queries Go Faster. Check Yours.

Leave a Reply Cancel reply