Download the Stack Overflow Sample Database for Postgres

15 Comments

For my Query Exercises, I like using the Stack Overflow data dump because it’s interesting data, real-world-sized, and has real-world-challenges with data distribution and slow queries. It’s easy to understand because it has just a handful of tables, and their relationships are straightforward. Plus, it’s open source and no charge for you – just pick a size:

  • Small 6GB, available via torrent or magnet. 1GB backup file that restores to a ~6GB database with data from the first 3 years of Stack Overflow’s existence. If all you need is a quick, easy, friendly database for demos, and to follow along with the Query Exercises here, this is probably all you need.
  • Large 117GB, available via torrent or magnet. 34GB backup file that restores to a ~117GB database with data up to December 2023.

After you download either one, log into your own Postgres server, create a stackoverflow database, and use pg_restore to import the dump file into your newly created database.

If you’re using DBeaver, and you created your database connection manually, the newly created database won’t show up until you edit your connection and check the box for “Show all databases”:

Show all databases

About the Database Contents

The data comes from the Stack Overflow data dump, which is licensed with cc-by-sa-4.0. When you use it, you have to attribute the data as specified in the license, making it clear where you got the data from. (You didn’t get the data from me – you got it from the kind folks at Stack Overflow. You don’t have to call me out by name since my part of the work is relatively tiny.)

The tables are fairly straightforward. Here are the most commonly queried ones:

  • users (21M rows) – everyone who’s ever created an account at Stack Overflow to post a question, answer, comment, etc. One of the neat things about StackOverflow.com is the URL structure: if you’re in the database looking at users.id = 26837, you can go to https://stackoverflow.com/users/26837, and presto, you see the web user interface for that particular row. The profileimageurl and emailhash columns are intentionally left null.
  • posts (60M rows) – questions and answers, as indicated by their posttypeid. There are other kinds of posts as well, but those are left as an exercise for the reader.
    • For answers, you can join to the question by looking at its posts.parentid – that’ll correspond to the question’s posts.id.
    • For questions, you can tell which answer was accepted because it’s stored in posts.acceptedanswerid, which will join to posts.id.
    • The user who asked the question (or posted the answer) is stored in posts.owneruserid. Generally, all of the relationships in the database are similarly labeled and easy to join on, although the database doesn’t have any foreign keys.
  • comments (90M rows) – both questions & answers can have comments attached to ’em. (To visualize this, check out a question that has comments, plus the answers have comments.)
  • votes (236M rows) – when you click the up or down arrow on a question or answer at Stack Overflow, you’re casting a vote. This table has a votetypeid column that joins to the votetypes table to tell you what kind of vote the row represents, like an upvote or downvote (or other). In the public data dump, they’ve sanitized this data a little so you can’t go figure out who’s been stalking your questions and downvoting them all.
  • badges (50M rows) – users can earn badges like ‘Informed’ or ‘Nice Answer’, and they get stored in this table. You’ll note that it isn’t normalized, and the actual name of the badge is stored in this table. That’s one of the things I like about this database – it’s useful for teaching both good habits and bad. (And of course, in the current live Stack Overflow database, the data doesn’t match the data dump format – the data dump format has just been the same for over a decade, and it’s maintained the same way for reverse compatibility.)

You can also learn the database structure by looking at user queries on Data.StackExchange.com. That’s a web-based version of SQL Server Management Studio where you can run queries against a nearly-live version of the Stack Overflow database, typically restored every weekend. That data is in Microsoft SQL Server, and there are more tables there than in this data dump, but the freely available queries may still be useful for you to learn the schema and relationships.

Known Issues

This is the first Postgres version I’ve published, and I wanted to get it out there to start experimenting and get feedback. There are a few known issues:

  • The restore files have ‘sql’ extensions when they’re really dumps. (DBeaver seems to override the file extension name when creating the restore in the GUI.)
  • The LinkTypes and PostTypes tables are not included.
  • Your restore may get errors about not being able to grant permissions to a readaccess role. That’s okay – the backup was sourced from AWS Aurora Postgres, which has different default roles than your server.
  • The small version’s row counts and data aren’t exactly the same as the SQL Server version of the Stack Overflow 2010 database that I’ve published at BrentOzar.com. This is the one known issue I don’t intend to fix. The small databases will never be in sync, and that’s okay – these databases aren’t for database platform comparison or benchmarking purposes. If you want to make multiple identical copies of the database for different platforms, your best bet will be to take the data dump files and create what you’re looking for.

Want to Do Easy Inserts?

Unfortunately I didn’t set the id columns to be sequences or serials, so if you want to insert rows without specifying the id, run this code to turn the id column into a sequence:

Sorry about that! I’ll fix that in the next version of the download.

Got Questions or Feedback?

I’ve been distributing the Microsoft SQL Server version for almost a decade, but this is my first time distributing a sample database for Postgres, so I know I’ve got a lot to learn. The torrents are labeled v0.1 because I see this as an early access release. I’m very open to feedback on what I should do differently.

  • Backup configuration settings or contents – other than the wrong dump file extension, is there anything I should change about the backup itself?
  • Indexes – I’m including several indexes. The small database is still small, and it makes it easier for beginners to follow along with the query exercises here on the site. However, I think for the final Large version, I’ll distribute that one without indexes just to make the restore times faster, and include a script to recreate the indexes to match the ones I commonly use in training.
  • Datatypes – because the main purpose of this database is for learning, I’m open to purposely using ‘bad’ datatypes for specific tables/columns if it’ll help teach concepts better, or mismatching datatypes between different tables/columns. If you’d like to change some of the datatypes, tell me which tables/columns, what you’d like to change, and why.
  • Download placement – I plan on only distributing it via BitTorrent to keep my costs low. I know that’s painful for some folks, and I’d love to do a direct download option, but it would need to be free (or nearly free). If you know of a free site that allows for unlimited downloads of a 34GB file, or if you’ve got a hook-up at an organization that can take care of that, I’d love to work with you on it.

Building Your Own Database

If you want to create indexes or test queries that change data, you’ll want to build your own Stack Overflow database. To do that, check out Francesco Tisiot’s tutorial on loading the Stack Overflow data. The exact queries to do it are at the bottom of his post.

That method requires downloading the original XML files. The ones we use on our read-only server are the big ones for StackOverflow.com itself:

Note that those file sizes are big, and when they’re unzipped with 7zip, they’re even bigger – hundreds of gigabytes. You don’t have to use the same Stack Overflow data that we use – you can pick the data for another smaller site. The smaller sites are distributed in a single 7z file for all of the files on the site. For example, here are some of ’em:

All of the sites use the exact same file names & formats – for example, anime.stackexchange.com and stackoverflow.com both have users.xml, posts.xml, comments.xml, etc.

Avoid the meta sites – that’s a different kind of data, discussion about the site itself, that tends to be extremely small.

Learning More About the Data

To learn more about the tables, columns, and their relationships to each other:

The stackoverflow database on that server is imported from the Stack Overflow data dump. This data, like Stack’s data dump, is provided by Stack Overflow under cc-by-sa 4.0 license. That means you are free to share this database and adapt it for any purpose, even commercially, but you must attribute it to Stack Overflow, the original authors (not Smart Postgres.)

Previous Post
Smart Postgres News and Techniques 2024 Vol 11
Next Post
Smart Postgres News and Techniques 2024 Vol 12

15 Comments. Leave new

  • […] the last week, I’ve been working on putting together a Postgres version of the Stack Overflow database, just like the SQL Server one that I’ve distributed for almost a decade […]

    Reply
  • I was unable to load StackOverflow data. I was following Francesco blog. When I run “\copy data_load from program ‘tr -d “\t” < Users.xml | sed -e ”s/\/\\/g”’ HEADER” i got error “The system cannot find the file specified.” The Francesco’s blog post have no ability to discuss it. I have zero prior knowledge of Postgres

    Reply
  • Hey Brent, this is my first time creating a database from a torrent. My company that bought me your SQL Server training already had VMs ready to go with the Stack Overflow DBs on them.
    I’m having trouble figuring out how to convert the .torrent file to a .dump file that I can restore into my Postgres server. I haven’t been able to find any blog posts or anything that cover this. If you could point me to any resources I’d appreciate it.

    All the best!

    Reply
    • Sure, the .torrent file is just a pointer that BitTorrent apps use to download the full file. It’s like a shortcut. Google or search YouTube for how to use BitTorrent to download something. Cheers!

      Reply
  • Hi Brent, Is there any alternative to download torrent or magnet in Macos, just switched to Mac from windows but could not find a way? I simply click the download link but cant seem to open whatever that file says as dmg format.

    Reply
    • No worries, i was able to figure out. However the restore of 117 GB version has been running for almost 6 hours now. Is this expected as i am from SQL server background where the restore on the machine i have should not ideally be taking this much time, unless PostgreSQL restore , which i am doing via pgAdmin works slowly. May need to study more on this.

      Reply
      • Hi Rohan. Well, it’s tough for me to troubleshoot performance on your particular laptop – hope that’s fair. You mentioned that you’re on a Mac – hopefully you’re on a relatively current Mac, made in the last couple of years. You may want to start learning about how to check performance on your Mac, like using Activity Monitor, but that’s pretty far beyond the scope of the training I do.

        Reply
  • […] Download the Stack Overflow Sample Database for Postgres […]

    Reply
  • I downloaded a client called aria2 that lets me control torrents via CLI, but although the ports are actively listening, it’s not downloading anything.

    Reply
  • Hi, I’m trying to download the large version of the Stack Overflow database (around 117 GB), but I can’t get it to work.
    I’m new to torrents and not sure how this works. People say that if there are no active seeders, the download just won’t start — and I don’t know what to do about that.
    Is there any other way to download the file, like a direct link or a .zip file instead of using torrent or magnet links?
    I’d really appreciate any help. Thanks.

    Reply
    • There are active seeders. You may have a firewall blocking BitTorrent.

      Unfortunately there’s no other way of downloading it because the file size is so large. It would literally cost me money each time someone downloaded it. Hope that’s fair.

      Reply
  • You were right, dear brother, and I thank you very much for making me realize the detail about the blockage caused by the firewall.

    Reply
  • […] Download the Stack Overflow Sample Database for Postgres […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed