I’ve worked with Microsoft SQL Server since the late 1990s. In that world, almost every table has what’s called a clustered index, which we think of as the data itself. The table’s data is stored in 8KB pages, and by creating a clustered index, you define how the table itself is sorted.
For example, let’s take the users table from the Stack Overflow database. In SQL Server, the table’s creation statement looks like this:
|
1 2 3 4 5 |
CREATE TABLE Users( Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, DisplayName VARCHAR(40), Location VARCHAR(100), Reputation INT); |
The resulting data structure is a b-tree set of 8KB pages where the Users are sorted by Id. This has a few interesting advantages:
- If you know a particular user’s Id, you can dive directly into that row, reading as few pages as possible. It’s like having an index on Id – but without the extra overhead of another index, or the overhead of reading extra pages to get columns beyond Id.
- When you insert a row, it’ll have the highest Id, so you can dive into the last page and add it there.
- Because inserts are always focused on the tail end of the table, it’s very likely that the tail end of the page is cached in memory, making inserts faster on
However, this also comes with some disadvantages:
- If for some reason you need to insert a row somewhere other than the highest Id, there probably won’t be enough space on the earlier pages. Say our current top Id is 100,000,000, but we wanna insert a new row for Id 500, then that 8KB page is probably already full. SQL Server has to do a page split to add an additional page to the index, move some of the rows over to that new 8KB page, insert the new row, and update the b-tree pages. You’d think this would be a rare scenario, but…
- This also happens when you need to update a variable-width column (or a nullable column) to have wider contents. Say we have a user whose DisplayName was just “Brent”, and they chose to update it to their full name. There might not be enough empty space on the 8KB page to accommodate that wider data, so SQL Server would have to allocate a new page and do a page split.
- This also happens if we need to add another version of a row. This wasn’t a big deal in the early days of SQL Server because the database engine just didn’t do row versioning for multi-version concurrency control (MVCC.) Microsoft didn’t add that (under the name of Read Committed Snapshot Isolation, aka RCSI) until 2005, and even today, it isn’t the default for new database creations. Database developers have to enable MVCC for each database they create – and most don’t, but the ones who do, experience issues with page splits and fragmentation.
To deal with this, especially in the days of rusty spinning frisbees, Microsoft database administrators became intimately familiar with index rebuilds and fill factor. These days, the topic of defragmentation is rather hotly contested amongst DBAs.
Not all tables used clustered indexes. Typical exceptions included:
- Staging tables – in data warehouses, it’s common to pull data from a source nightly, load that into a staging table, do some processing on it, load it into its final destination, and then delete/truncate it. In that case, a heap made sense because any query that worked on the staging table accessed all of the rows, not needing to dive into any particular row.
- Data warehouse fact tables – Microsoft had a reference architecture called the Fast Track Data Warehouse Reference Architecture in which all fact tables were heaps, and query performance just came down to how fast you could scan the entire fact table. Those fact tables didn’t have any indexes on them at all because we could never predict what data warehouse users would search for, or what order they’d need the data in. Later on, as Microsoft’s columnstore implementation improved, data warehouses began to migrate over to clustered columnstore instead.
Meanwhile, in the Postgres world…
Postgres starts with a very different assumption: it uses multi-version concurrency control (MVCC, or what SQL Server folks call RCSI) by default, right from the get-go. When we update rows, Postgres automatically makes new versions of the rows, storing them inside the same table. There’s no need to enable it – it’s on for all databases, by default.
So because Postgres is constantly creating new versions of rows, it doesn’t really make sense to store the table itself in some kind of order.
Every table is a heap.
You can create indexes, primary keys, and foreign keys, but there’s no such thing as a clustered index. It’s just not an option in Postgres.
However, you can cluster a table.
But that statement is really deceiving. The cluster command’s syntax looks like this:
|
1 |
CLUSTER users USING ix_users_id; |
Where ix_users_id is an index you created on the users.id column. That will sort the Users heap rows ordered by the same order as the ix_users_id index. However, that’s only a one-time operation, done at the moment you execute that CLUSTER command, and the heap’s order is not maintained going forward.
You can rerun CLUSTER users later, without specifying the index name again, and it’ll automatically re-sort the data using the same index it used last time.
You can kinda think of that as index maintenance that a database administrator would do, but it’s not really the main index maintenance task we have to worry about in Postgres. No, Postgres has its own interesting index maintenance challenge, and we’ll leave that for the next Postgres Surprise post.

12 Comments. Leave new
Interesting to know that all tables are heaps in postgres. So does that mean they (the rows of the data) get stored to disc in a random order? And to ask another question, when you say “storing new versions” of rows, is that to imply that there could be different versions of rows existing at a time? Sorry the phrasing made me curious. I guess I could google this. Very informative, thanks.
They copied Oracle DB design. Oracle DB is also a Heap-based database unordered table design. But unlike PostgreSQL, Oracle has IOT (Index-Optimized-Tables), a special-purpose table which is kind of similar to SQL Server’s Clustered Index (not 100% identical).
Yes diff versions of a row can exist in the data block for sometime, before a garbage cleaning process called VACUUM clears out dead tuples [ PG speak for rows ]. However, from application pov, only row which is current transaction is entitled to view will see it with no extra coding. There is a PG extension which can show dead tuples.
And this is how you end up in vacuum hell…
Vacuum served it purpose, but with modern SSD it is more of a liability. PG is moving towards a new storage system which will closely mimic Oracle’s UNDO tablespace design and stop storing tuples in the data block itself.
Hmm, can you point me to more reading material on this new storage system? I haven’t heard anything about this.
https://www.orioledb.com/docs
That’s an extension, not part of core. There are already lots of extensions for Postgres, but that doesn’t mean the database engine itself is headed in any one of those particular directions.
[…] Brent Ozar shares a surprise if you’re coming from the SQL Server world: […]
[…] my first Postgres Surprise post, we talked about the fact that Postgres’s tables are heaps: unsorted copies of the data, as […]
The core engine part is zheap storage engine.
Google “postgres zheap”
However it is moving very slow and there is no target date.
I think “slow” is being a little generous – the Github project hasn’t seen activity in half a decade: https://github.com/EnterpriseDB/zheap That project was abandoned.