PGconf.dev Vancouver 2024: Day 2 Notes

Last Updated: August 29, 2025

I’m in Vancouver this week for PGconf.dev 2024. Yesterday I blogged my notes from Day 1, and continuing on, here are the sessions I attended on Thursday:

PostgreSQL 17 and Beyond by Amit Kapila

Some of the headlines for 17 (coming later this year) are failover slots to allow logical replication to continue, incremental backups, removing the 1GB limit for vacuum cleanup, JSON_TABLE for better SQL/JSON standard compliance, SPLIT/MERGE of partitions.

He summarized each of the features with a nice balance of summary vs detail, explaining how the work was done without going into detail on the exact code. This was a nice summary for developers to be aware of other things happening in the engine.

For PostgreSQL 18 and beyond, some of the features under consideration include:

Transparent column encryption in the client
Asynchronous IO, like index prefetching
Import/export of statistics so you could
Improvements in logical replication (DDL replication, replicating squences, conflict detection & resolution, all moving towards active/active servers)
Parallelism – parallel vacuum, correlated subqueries
Add compression at the wire protocol level

PostgreSQL Memory Management by Krishnakumar “KK” Ravi

KK works for Microsoft, and since they host a lot of Postgres, it’s in their best interest to minimize the hardware requirements. I would also assume other vendors (Amazon, GCP, Neon, etc) would have similar goals.

KK explained that Postgres is process-based. The Postmaster process allocates a bunch of memory to be shared, and then as child processes get forked off, they get access to the same shared pool of memory. A classic historic example: when one process needs a latch on something, it accesses shared memory. Shared memory is also being used for newer features like parallel hash joins, and that needs to be dynamic to accommodate changing data sizes.

As a relative Postgres n00b, I was surprised to hear about the double buffering problem: when Postgres asks for pages from a data file, those pages get cached inside Postgres’s buffer pool as well as the underlying operating system’s file cache as well. This becomes particularly problematic when you stack multiple Postgres instances onto the same host. That one topic alone was a big ah-ha for me about why KK (and Microsoft) want to improve memory handling. He discussed Palak Chaturvedi’s work on pg_buffercache_evict work coming in PG17 which is useful for testing for now, but could eventually be used as part of a larger project to manage the buffer pool.

Similarly, I was surprised to hear that Postgres’s shared memory footprint is essentially static: it doesn’t expand or shrink. I can see how that’d be very problematic for hosting providers.

He discussed different possibilities for handling memory going forward, like using a pool of multiple segments, each with different properties. One segment could be removed to free up memory, or additional segments could be added under load. (I like hearing this kind of open discussion, knowing that the community is talking through the pros and cons of different approaches.)

Data Corruption Bugs by Noah Misch

Noah works for Google, and he shared stories of Postgres data corruption experienced by their customers. He starts by using amcheck with heapallindexed on to detect corruption, then pageinspect to examine the data.

He explained several corruption bugs he’d caught and patched. My favorite was that if you simultaneously grant permissions on a table, while you’re creating an index (i1) on it, the index would get created but not get any subsequent updates. Then, when another index (i2) got created, the i1 index would come back into use – but not have any changed data since when it was created. The discovery of that bug led to the discovery of 5 others.

He then talked about how tests could be implemented in the future to catch similar types of bugs. Some of the tests were quite resource-heavy, like in a test suite, doing a backup or an amcheck after every single WAL record or log flush. Attendees chuckled at the thought of a test suite that might take 1-24 hours to run.

The session had a friendly, cooperative vibe with people interested in sharing techniques and fixing the problems long term.

Scaling RDS PostgreSQL by Alisdair Owens and Andrei Dukhounik

Alisdair explained that RDS’s control plane is actually based atop a database in RDS. (Turtles all the way down.) If this control plane database goes down, the rest of RDS won’t be controllable either, so it’s important that RDS be reliable. The primary node in large reasons tends to run hundreds of thousands of transactions per second, mostly small writes and lookups.

I liked how they said, “At this scale, tactical optimizations drop in effectiveness over time, and query de-optimizations become more impactful.” What a polite way of saying, “When the server gets this busy, if anything goes wrong, you’re boned.”

Andrei took the stage to cover performance management techniques they use, and then how they optimize the queries that take the most time or execute the most often. (There’s a kernel of a great session topic for general conferences in here: Tune Your Queries Like Amazon Does.)

They prefer simple queries over complex ones, and in the event of complex requirements, they actually prefer the developers breaking them up into several simpler queries, running them individually, and combining/joining the results on the client side. Even if that technique results in more work for the developers and less efficient apps (due to multiple round trips to the server), it results in more predictable query runtime overall. Similarly, they also avoid doing query plan hints or custom statistics because they just have too many developers, and not enough performance tuning DBAs to do that kind of work.

Alisdair came back up to talk about performance reliability, and he expressed it in terms of timeouts. Statements and lock timeouts need to be in place on the client side, but more than that, each client needs its own circuit breakers. If they experience too long of latency or too many errors, the application needs to take itself offline rather than take the shared database server offline.

To avoid long locks, some kinds of changes simply aren’t allowed: transactional DDL, changing column types, constraint changes, etc. Deployments have to be written to run at the same time as customers access the system, and have to be retry-able. Deployments are allowed to kill application queries that are interfering with the deployment.

The RDS team’s wish list for operational predictability included user-level resource limits, plan stability, transaction timeouts, bulletproof logical replication, an enforced read-only switch, and ‘safe’ schema changes. (By ‘safe’ schema changes they mean not allowing changes that would cause big blocking chains, or by rolling back queries that are stopping the DDL change.)

Their wishlist for performance included better connection scalability (handling more connections on a server), multi-threaded per-table pg_dump, and incremental materialized view refresh.

Finally, their wish list for operational insight had daily pg_stat_statemnts partitions, more logging options for operations that generate excess workload, and historical deadlock information. (I had to chuckle when the first attendee question was, “So, on your wish list, you had a lot of hard wishes…”)

Engineering Blog Posts by Claire Giordano

I’ve been blogging for ~25 years, but my techniques could always use updating, so I was curious to hear the advice Claire would give to this particular audience.

Claire suggested that you think about a specific reader, think about someone you care about, who isn’t an expert yet. Think about where they’re coming from, and have empathy for them. Empathy can mean understanding that people learn in different ways – some people want to see illustrations, other people want to read text. Recognize that they’re busy and in a hurry, so you want to make things easily digestible no matter their learning format. To improve digestibility, use section headlines, bullets, short sentences, short paragraphs, and lots of whitespace.

Embrace the iteration, Claire said: edit your stuff, revise it over time, reconsider it. Andres Freund (yes, that one) emphasized that it’s the same as developing software: you don’t expect to get it right the first time. I suffer on this one a little: I schedule blog posts weeks in advance, but I don’t go back and edit them as often as I should. I get too confident in the first version, hahaha. (And then other times, like this particular post, I just hit publish as soon as the day is over, and off we go.)

For good SEO, Claire suggested writing good content, picking discoverable & clickable titles, avoid general terms, and stick to relevant technical terms that your readers use. The author’s bio also needs to show expertise, authoritativeness, and trustworthiness because that can affect SEO as well. Then, distribute & promote your blog where your readers are: Planet Postgres, Reddit, HackerNews, etc.

That’s a wrap for day 2! There was also a later panel session on Making PostgreSQL Hacking More Inclusive, but I had dinner plans with a local friend (Blythe Morrow of Paper Sword B2B Marketing) and I had to bail.

Today gave me the exact same vibes as yesterday: I loved seeing people explain the technical problems they were facing, then solicit ideas and help from others. It was so cool seeing the hallway discussions between people.

Tomorrow’s agenda is an unconference: a participant-driven day where people can propose their own topics, and people can group together for it. I’m curious to see how this goes.

PGConf.dev Vancouver 2024: Day 1 Notes

PGconf.dev Vancouver 2024: Day 3 Notes