Tuesday, July 17

PG Phriday: Postgres on ZFS

ZFS is a filesystem originally created by Sun Microsystems, and has been available for BSD over a decade. While Postgres will run just fine on BSD, most Postgres installations are historically Linux-based systems. ZFS on Linux has had much more of a rocky road to integration due to perceived license incompatibilities.

As a consequence, administrators were reluctant or outright refused to run ZFS on their Linux clusters. It wasn’t until OpenZFS was introduced in 2013 that this slowly began to change. These days, ZFS and Linux are starting to become more integrated, and Canonical of Ubuntu fame even announced direct support for ZFS in their 16.04 LTS release.

So how can a relatively obscure filesystem designed by a now-defunct hardware and software company help Postgres? Let’s find out!

Eddie waited til he finished high school

Old server hardware is dirt cheap these days, and make for a perfect lab for testing suspicious configurations. This is the server we’ll be using for these tests for those following along at home, or want some point of reference:

  • Dell R710
  • x2 Intel X5660 CPUs, for up to 24 threads
  • 64GB RAM
  • x4 1TB 7200RPM SATA HDDs
  • H200 RAID card configured for Host Bus Adapter (HBA) mode
  • 250GB Samsung 850 EVO SSD

The H200 is particularly important, as ZFS acts as its own RAID system. It also has its own checksumming and other algorithms that don’t like RAID cards getting in the way. As such, we put the card itself in a mode that facilitates this use case.

Due to that, we lose out on any battery-backed write cache the RAID card might offer. To make up for it, it’s fairly common to use an SSD or other persistent fast storage to act both as a write cache, and a read cache. This also transforms our HDDs into hybrid storage automatically, which is a huge performance boost on a budget.

She had a guitar and she taught him some chords

First things first: we need a filesystem. This hardware has four 1TB HDDs, and a 250GB SSD. To keep this article from being too long, we’ve already placed GPT partition tables on all the HDDs, and split the SSD into 50GB for the OS, 32GB for the write cache, and 150GB for the read cache. A more robust setup would probably use separate SSDs or a mirrored pair for these, but labs are fair game.

We begin by creating the ZFS pool itself and showing the status so we know it worked:

$> zpool create tank -o ashift=12 \
    mirror sdb1 sdd1 \
    mirror sdc1 sde1 \
       log sda4 \
     cache sda5
 
$> zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sun Dec 10 00:24:02 2017
config:
 
    NAME                            STATE     READ WRITE CKSUM
    tank                            ONLINE       0     0     0
      mirror-0                      ONLINE       0     0     0
        sdb1                        ONLINE       0     0     0
        sdd1                        ONLINE       0     0     0
      mirror-1                      ONLINE       0     0     0
        sdc1                        ONLINE       0     0     0
        sde1                        ONLINE       0     0     0
    logs
      sda4                          ONLINE       0     0     0
    cache
      sda5                          ONLINE       0     0     0
 
errors: No known data errors

That first zpool command created the equivalent of a 4-disk RAID-10 with separate read and write cache. Besides being disturbingly easy, it also mounted the filesystem automatically. We could literally start storing data on it immediately, but let’s take some time to tweak it first.

Compression isn’t always enabled by default, and it’s usually common to disable atime so every read doesn’t have a corresponding disk write. We should probably also create a separate area for the database storage to keep things tidy.

zfs set compression=lz4 tank
zfs set atime=off tank
zfs set relatime=on tank
 
zfs create tank/db

This test is also taking place on a Debian system, so we have access to several cluster creation and configuration utilities. Let’s create a dedicated ZFS instance and set some common configuration parameters:

pg_createcluster 10 zfs -D /tank/db
systemctl daemon-reload
 
pg_conftool 10 zfs set shared_buffers 4GB
pg_conftool 10 zfs set work_mem 12MB
pg_conftool 10 zfs set maintenance_work_mem 1GB
pg_conftool 10 zfs set random_page_cost 2.0
pg_conftool 10 zfs set effective_cache_size 40GB
 
systemctl start postgresql@10-zfs
 
createdb pgbench
pgbench -i -s 100 pgbench

At the end we also created a pgbench database and initialized it with 10M records for testing purposes. Let’s get started!

They moved into a place they both could afford

Let’s start by getting a performance baseline for the hardware. We might expect peak performance at 12 or 24 threads because the server has 12 real CPUs and 24 threads, but query throughput actually topped out at concurrent 32 processes. We can scratch our heads over this later, for now, we can consider it the maximum capabilities of this hardware.

Here’s a small sample:

$> pgbench -S -j 32 -c 32 -M prepared -T 20 pgbench
 
...
tps = 264661.135288 (including connections establishing)
tps = 264849.345595 (excluding connections establishing)

So far, this is pretty standard behavior. 260k prepared queries per second is great read performance, but this is supposed to be a filesystem demonstration. Let’s get ZFS involved.

The papers said Ed always played from the heart

Let’s repeat that same test with writes enabled. Once that happens, filesystem syncs, dirty pages, WAL overhead, and other things should drastically reduce overall throughput. That’s an expected result, but how much are we looking at, here?

$> pgbench -j 32 -c 32 -M prepared -T 10 pgbench
 
...
tps = 6153.877658 (including connections establishing)
tps = 6162.392166 (excluding connections establishing)

SSD cache or not, storage overhead is a painful reality. Still, 6000 TPS with writes enabled is a great result for this hardware. Or is it? Can we actually do better?

Consider the Postgres full_page_writes parameter. Tomas Vondra has written about it in the past as a necessity to prevent WAL corruption due to partial writes. The WAL is both streaming replication and crash recovery, so its integrity is of utmost importance. As a result, this is one parameter almost everyone should leave alone.

ZFS is Copy on Write (CoW). As a result, it’s not possible to have a torn page because a page can’t be partially written without reverting to the previous copy. This means we can actually turn off full_page_writes in the Postgres config. The results are some fairly startling performance gains:

$> pgbench -j 32 -c 32 -M prepared -T 10 pgbench
 
...
tps = 10325.200812 (including connections establishing)
tps = 10336.807218 (excluding connections establishing)

That’s nearly a 70% improvement. Due to write amplification caused by full page writes, Postgres produced 1.2GB of WAL files during a 1-minute pgbench test, but only 160MB with full page writes disabled.

To be fair, a 32-thread pgbench write test is extremely abusive and certainly not a typical usage scenario. However, ZFS just ensured our storage a much lower write load by altering one single parameter. That means the capabilities of the hardware have also been extended to higher write workloads as IO bandwidth is not being consumed by WAL traffic.

They both met movie stars, partied and mingled

Astute readers may have noticed we didn’t change the default ZFS block size from 128k to align with the Postgres default of 8kb. This is what happens when we do that:

$> zfs get compressratio tank/db
 
NAME     PROPERTY       VALUE  SOURCE
tank/db  compressratio  1.71x  -

Now let’s compare that to the ZFS default of 128kb:

$> zfs get compressratio tank/db
 
NAME     PROPERTY       VALUE  SOURCE
tank/db  compressratio  7.59x  -

As it turns out, the 128kb blocks allow ZFS to better combine some of those 8kb Postgres pages to save space. That will allow our measly 2TB to go a lot further than is otherwise possible.

Please note that this is not de-duplication, but simple lz4 compression, which is nearly real-time in terms of CPU overhead. De-duplication on ZFS is currently an uncertain bizzaro universe populated with misshapen horrors crawling along a broken landscape. It’s a world of extreme memory overhead for de-duplication tables, and potential lost data due to inherent conflicts with the CoW underpinnings. Please don’t use it, let anyone else use it, or even think about using it, ever.

They made a record and it went in the chart

We’re still not done. One important aspect of ZFS as a CoW filesystem, is that it has integrated snapshots.

Consider the scenario where a dev is connected to the wrong system and drops what they think is a table in a QA environment. It turns out they were in the wrong terminal and just erased a critical production table, and now everyone is frantic.

We can simulate that by dropping the table ourselves. For the purposes of this demonstration, we’ll take a snapshot before doing so:

zfs snapshot tank/db@saveme
psql pgbench -c "drop table pgbench_accounts"

Agh! Luckily, ZFS has a clone ability that makes full use of its CoW nature. While ZFS snapshots are always available in a read-only manner, a clone is a fully writable fork of that snapshot. This means we can make a few slight changes and bring the clone online with the same data as when we took the snapshot.

Namely, we may need to copy and modify configuration files and remove any lingering postmaster.* files that may conflict with the running instance. Then we can just start the clone on a different port. At that point, there are two Postgres instances, and we can dump the missing pgbench_accounts table from the clone and import it back into the main database without skipping a beat.

Here’s how that would work:

zfs clone tank/db@saveme tank/recover
 
cd /tank/recover
rm postmaster.*
 
cp /usr/share/postgresql/10/postgresql.conf.sample postgresql.conf
cp /etc/postgresql/10/zfs/pg_hba.conf pg_hba.conf
 
pg_ctl -D /tank/recover -o "-p 5433" start
 
pg_dump -p 5433 -t pgbench_accounts pgbench | psql pgbench

With the table restored, everyone was happy, and the infrastructure team put in new firewall rules that prevented development systems from connecting to prod. All learned a valuable lesson and moved on with life.

Afterwards, we just need to clean up by stopping the clone Postgres instance and erasing the associated mount:

pg_ctl -D /tank/recover stop -m immediate
zfs destroy tank/recover

Many ZFS systems have automated snapshot creation and cleanup maintenance jobs running most of the time. This means we may already have an hourly, daily, or weekly snapshot available for recovery or investigation purposes.

While the Linux Volume Manager (LVM) system has similar functionality, LVM snapshots are limited to non-allocated space in the volume group, and must be mounted manually. ZFS snapshots come from the same storage pool, so we don’t need to guess at how large future snapshots may need to be. We don’t lose that potential allocation space from our volume. Clones are also mounted automatically in a way that shows their relation to the source.

ZFS makes it extremely easy to leverage snapshots. In fact, a development system could even designate a single Postgres instance as a snapshot source, and provide dozens of developers their own online copy of the database. It is after all, the same approach used for VMs running on ZFS.

ZFS also has a command that can send a snapshot to a remote ZFS filesystem. If a previous snapshot already exists there, it only sends the differences. This built-in differential is actually much faster than even rsync, as it already knows exactly which blocks are different. Consider what this means for VLDB Postgres systems that are several TB in size.

The future was wide open

It’s difficult to discount an immediately observable reduction in write overhead. Snapshots have a multitude of accepted and potential use cases, as well. In addition to online low-overhead compression, and the hybrid cache layer, ZFS boasts a plethora of features we didn’t explore.

Built-in checksums with integrated self-healing suggest it isn’t entirely necessary to re-initialize an existing Postgres instance to enable checksums. The filesystem itself ensures checksums are validated and correct, especially if we have more than one drive resource in our pool. It even goes the extra mile and actively corrects inconsistencies when encountered.

I immediately discounted ZFS back in 2012 because the company I worked for at the time was a pure Linux shop. ZFS was only available using the FUSE driver back then, meaning ZFS only worked through userspace with no real kernel integration. It was fun to tinker with, but nobody sane would use that on a production server of any description.

Things have changed quite drastically since then. I’ve stopped waiting for btrfs to become viable, and ZFS has probably taken the throne away from XFS as my filesystem of choice. Future editions of the Postgres High Availability Cookbook will reflect this as well.

Postgres MVCC and ZFS CoW seem made for each other. I’m curious to see what will transpire over the next few years now that ZFS has reached mainstream acceptance in at least one major Linux distribution.

15 Comments

  • Craig

    I love PostgreSQL on ZFS. I’ve been doing it for a while, first on FreeBSD, now on Ubuntu. I can’t go back to any other FS now.

    Some things I’ve learned:
    * Try building your zpool with a RAIDZ instead of RAID10. You should get better write performance, even with the SSD write log. As always, run your own benchmarks to make sure.
    * If you do this in production, make sure the log SSD has enough power to complete writing during a power outage. You don’t want to lose that device. I use Intel DC series SSDs.
    * ZFS Snapshots are great for setting up Streaming Replication.
    * ZFS Snapshots are great for running pg_upgrade –link

    • Shaun Thomas

      All excellent points. Snapshots have a lot more potential than I explored here, so it’s good to see more from people using it in the field.

  • Miha

    We have an experience of running many VMs on a ZRAID and I wouldn’t recommend. The thing is slow as hell when VMs are hitting disks hard. Also the ZLOG din’t help a bit. we’ve got much better performance wit it turned off.

  • TJ

    Great article! Im a big fan of zfs for a long time. It is a shame you havent done any benchmark with ext3 and lvm on the same hw for comparison.

    • Shaun Thomas

      I’d have to tear apart everything and re-run if I wanted to do that. And I’d likely go with EXT4 instead of EXT3. Regardless, EXT4 is iffy with filesystems over 16TB, and I’d be terrified to ever fsck such a monstrosity in any case. In reality, it’s just a 4-disk RAID-10 with 7200 RPM drives. There’s a theoretical maximum that kind of hardware can deliver, and ZFS obliterated them thanks to the ZIL SSD passthrough. In theory, LVM has an SSD cache layer as well, but the last time I tried to use it, things didn’t go well. Granted, that was three or four years ago, so it’s probably worth revisiting, just like I did with ZFS.

      However, even assuming LVM+EXT4 works properly now, you still don’t get checksums or in-line compression, and you have to juggle the stack of individual parts. ZFS provides everything as a packaged deal. I’ve come to accept it’s more capable than any other Linux filesystem stack short of CEPH or GlusterFS.

  • Andrew Dunstan

    Years ago I used compressed ZFS as the backing store for Barman. It was far more efficient than the user-level compression, and we got both more space to use and a very significant reduction in backup time time due to the reduced write volume.

    • Eugene Vilensky

      Thank you for this post. Do you have any intent to get the H200 out of the way completely and maybe dedicate the SSD as Zfs Intent Log device?

      • Shaun Thomas

        So far as I know, there’s no “getting the H200 out of the way”. The on-board controller is only 4-ports, and they’re only SATA-3. In fact, I should probably put the SSD on the H200 as well, because it is theoretically being throttled by the on-board SATA.

        So far as the ZIL, only a portion of the SSD is necessary. It is literally impossible to saturate the full SSD with this hardware using any kind of workload I can generate by engaging all of the CPUs. In a real setup, I’d probably get two small SSDs and create a ZIL mirror between them. During an extended test, I monitored the throughput of the ZIL, and even though it’s 32GB, it never got anywhere near even a fraction of being full. It was however, saturated at 99% utilization while flushing to the HDDs. I suspect it’s because it’s on the SATA-3 channel while the HDDs are on the SATA-6.

        Here’s the best part not in the article, though… I temporarily disabled the ZIL partition and re-ran the tests. Instead of 10k write TPS, it was closer to 1200. Not terrible for a traditional 4-disk RAID-10 if we’re being honest, and probably what we’re likely to get from EXT4. But absolutely horrendous by comparison now that we’ve seen what ZFS is capable of.

  • Michael Schout

    Great article.

    Have you seen the OpenZFS Postgres tuning recommendations? They recommend using separate datasets for data and WAL, and on the data volume, use -o recordsize=8k (to match the database record size – granted, lz4 compression makes this less useful, but 8k still seems like a good limit as I believe the default is 128k, so each record write would involve writing the whole 128k record by default rather than 8k. With lz4 compression enabled the records will still be at most 8k.

    They also recommend logbias=throughput on the data volume to avoid writing twice. I’d be curious to know if you benchmarked with either or both of these. Also, its unclear to me if its still safe to disable full_page_writes if using logbias=throughput. I’m not a zfs expert by any means.

  • Phil Krylov

    Great post. However, you don’t ever mention the lack of SSD TRIM/discard support in the latest stable (and even unstable) ZoL releases. Don’t you see this as a problem?

    • Shaun Thomas

      Possibly. I’ve been told that modern drives don’t need TRIM as much as older ones. In this case as there are no files being saved long-term, it’s probably less relevant. It didn’t seem to affect my testing, but I also didn’t check for that. My write cache is only 32GB, so it’s probably not too hard to get it to a point where not having TRIM could hurt performance. I’ll add that to my list of benchmarks to run later.

    • Shaun Thomas

      Right. My homelab bench system is specifically for screwing around. In a real production server, you must use SSDs with integrated super-capacitors. I have my server on a UPS for blips and brownouts, but there’s no substitute for real loss protection.

Leave a Reply

Your email address will not be published. Required fields are marked *