Saturday, November 18

PG Phriday: Getting RAD with Docker [Part 2]

In our last article, we explored how to run Postgres in some very basic Docker scenarios. Based on our experiments back then, we can obtain images, create containers, and mount to various types of storage.


It’s not just boring, it’s mundane. It doesn’t do anything. Sure we can run Postgres in a container, but that’s true about a lot of things. You know what’s exciting? Setting up Postgres streaming replication between two docker containers.

Let’s get started.

He say “I know you, you know me”

The first thing we need to do is create a container that will be our primary host. Postgres requires a user with REPLICATION permission, as well as a special entry in the pg_hba.conf file for the “replication” pseudo-database. We could start a regular Postgres container, connect to it, and set all of that up ourselves, but it’s much cooler to create our own image which does that for us.

The Postgres Docker Hub image has excellent instructions for extending the official image, so let’s start there.

Since we know the the two modifications we need to make, let’s create a basic script to automate the steps:

if [ $(grep -c "replication repuser" ${PGDATA}/pg_hba.conf) -gt 0 ]; then
    exit 0
psql -U "$POSTGRES_USER" \
     -c "CREATE USER repuser WITH REPLICATION" postgres
echo "host replication repuser all trust" >> ${PGDATA}/pg_hba.conf
pg_ctl -D ${PGDATA} reload

Since we could be attaching to a pre-existing Postgres volume, we need to check whether or not we already modified it so the repuser user could connect. Aside from that, all we do is create the user and add a very insecure host entry so it can connect to initiate streaming replication.

He got hair down to his knee

The next step is to create a Dockerfile to define our image. We didn’t do much, so that’s a thankfully easy task. Here’s how that looks:

FROM postgres:latest
COPY scripts/ /docker-entrypoint-initdb.d

We saved our previous script as scripts/ to keep things together. The official Postgres image suggests we can copy that file to a special location and it’ll run automatically. Not only is that convenient, but it keeps us from having to know a lot of Docker syntax at this point.

Here’s the full contents of the folder where we’re constructing our image source:


And now we just create our image:

$> docker build --tag postgres-primary postgres-primary
Sending build context to Docker daemon  5.632kB
Step 1/2 : FROM postgres:latest
 ---> 4860bdf1a517
Step 2/2 : COPY scripts/ /docker-entrypoint-initdb.d
 ---> 03935680f9e2
Successfully built 03935680f9e2
Successfully tagged postgres-primary:latest

Got to be a joker he just do what he please

That’s one image out of the way, but a replication stream requires at least one more component: a replica. To build that, we actually have a slight problem because we need to know the hostname of the container we’re cloning. We can remove that roadblock by requiring users to pass an environment variable with that information.

We also need to stop and erase any existing Postgres instance created by the official image, while also refraining from erasing a legitimate replica. It’s a tall order, but this very basic script has all the requisite parts:

set -e
# This is where we tell the invoker to set PRIMARY_NAME.
if [ "${PRIMARY_NAME}" == "unset" ]; then
  echo "ERROR: Please set PRIMARY_NAME to stream source container."
  exit 1
# Don't erase an existing replica. This means we've already bootstrapped.
if [ -f ${PGDATA}/recovery.conf ]; then
  exit 0
pg_ctl -D ${PGDATA} stop -m fast
rm -Rf ${PGDATA}/*
pg_basebackup -U repuser -X stream -R -c fast \
              -h ${PRIMARY_NAME} -D ${PGDATA}
pg_ctl -D ${PGDATA} start

The “trick” to this script is that we look for a recovery.conf file. That tells us a replica already exists in this instance, and we can skip replacing it with an upstream clone. A more “correct” script might actually check the file to ensure the replica is subscribed to the indicated PRIMARY_NAME, but this is just a demo so it’s OK to be a little sloppy.

He got feet down below his knee

As before, we need another Dockerfile to actually coordinate the moving parts, such as they are. This time we also need to define our environment variable along with copying the script.

FROM postgres:latest
COPY scripts/ /docker-entrypoint-initdb.d

The ENV command requires we set the variable to something, so we just use ‘unset’. Beyond that, there’s really not much else to do. All the heavy lifting is still in the official Postgres image.

Here’s the full contents of the folder where we’re constructing our image source:


And now we just create our image:

$> docker build --tag postgres-replica postgres-replica
Sending build context to Docker daemon  3.584kB
Step 1/3 : FROM postgres:latest
 ---> 4860bdf1a517
Step 2/3 : ENV PRIMARY_NAME unset
 ---> Running in 0a119ba1f6c8
 ---> 2f389fa04d93
Removing intermediate container 0a119ba1f6c8
Step 3/3 : COPY scripts/ /docker-entrypoint-initdb.d
 ---> 90796d5559b7
Successfully built 90796d5559b7
Successfully tagged postgres-replica:latest

He got muddy water, he one mojo filter

One of the nice things about Docker is that it has built-in networking capabilities. Networking can be an unwieldy and cantankerous Rube Goldberg contraption of the highest order, but Docker tucks all that away into some unholy dark recesses where it can’t hurt us.

All we need to know is that we can create a network bridge. Once we have a network bridge, we can assign it to newly launched containers. This gives all containers assigned to that bridge the same network operational context. Nodes are labeled after the container name, making it trivial to reference them later.

Making the bridge itself is pretty darn easy:

$> docker network create --driver bridge pg_stream

As with everything Docker, we get a hash designating our network. Now let’s put it to use!

He say, “One and one and one is three”

Now comes the fun part. In our last article, we exposed the network interface to the host machine, but this time we’re going to specifically avoid that. Port mapping is ugly and there are better ways to connect to containers, especially when they can interact in their own network sandbox.

Let’s begin by actually launching the containers:

$> docker run --name pg-primary --network pg_stream -d postgres-primary
$> docker run --name pg-replica --network pg_stream -d \
      -e PRIMARY_NAME=pg-primary postgres-replica

Since running a daemonized container also hides the console output, we need some way of verifying those output hashes mean everything worked as expected. As it turns out, we can connect directly to the container of the primary node and outright ask it if there are any streaming replicas connected.

Let’s do that now:

$> docker exec -it pg-primary psql -U postgres -x \
       -c "SELECT * FROM pg_stat_replication"
-[ RECORD 1 ]----+------------------------------
pid              | 76
usesysid         | 16384
usename          | repuser
application_name | walreceiver
client_addr      |
client_hostname  | 
client_port      | 39600
backend_start    | 2017-11-05 23:44:01.613458+00
backend_xmin     | 
state            | streaming
sent_lsn         | 0/3000060
write_lsn        | 0/3000060
flush_lsn        | 0/3000060
replay_lsn       | 0/3000060
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async

Well how about that? One streaming replica. How about another?

$> docker run --name pg-replica2 --network pg_stream -d \
      -e PRIMARY_NAME=pg-primary postgres-replica
$> docker exec -it pg-primary psql -U postgres \
       -c "SELECT pid, client_addr FROM pg_stat_replication"
 pid | client_addr 
  76 |
  97 |

Has it ever been easier to clone a Postgres node?

One thing I can tell you is you got to be free

The cool thing about the replica recipe is that it works for any Postgres container. If we followed some of the steps in the last article, we may have created the Postgres data directory on the native filesystem. Maybe that’s even where it started, and we’re just using a temporary container to investigate its contents.

So long as the primary container is running, we can clone it over and over again. The replicas can also live on the native filesystem, and so long as there’s a recovery.conf file, the replica will run normally. This also means if we promote a replica and restart the container, it will clone upstream again.

If we want to avoid that, we can write a more advanced bootstrap script that also checks for recovery.done, and if found, applies pg_rewind. That will allow us to follow the primary again without all the overhead of rebuilding the replica from scratch.

There are obviously more safety checks we could apply, but as a proof-of-concept, this is actually slightly ridiculous. Someone armed with these two images can create a primary Postgres node and an army of replicas with minimal knowledge about Postgres itself. It’s a set of black boxes perfect for stamping out a very simple cluster.

You can’t really get more RAD than that.


  • Matt Bunter

    I can see the interest in containers/Docker for devs and testers. But is the added complexity interesting for Prod environments? I mean how many databases on how many different ‘machines’ do we really want? And what do we do when we have to modify PostgreSQL params? How often does a test or dev environment get to the max_connections value? Or is this a testing issue? What about version changes? What about extensions and version changes? Am I just being an old fart with these dumb questions?

    • Shaun Thomas

      Don’t feel bad for questioning any methodology. As with all things, there are areas of application that work better than others. In this case, Docker is great for experimentation and testing.

      Altering parameters is easily done through ALTER SYSTEM, and if the parameter requires a restart, PgBouncer can hide that easily enough. Extensions, versions, etc., are perfectly tailored to containers, because their contents are either suspect or wholly disposable. As with this article, make a copy, promote it, experiment as much as is necessary. Upgrade the extensions. Alter tables. Deploy the latest application DDL and run smoke tests. Do whatever you want, and then throw it away. The original database is still there for the next replica.

      More than that however, is that I’ve heard arguments in the other direction as well, in that production isn’t automatically bad practice. Docker containers are ultimately ephemeral, and as such, there are no more lies regarding instance stability. Provided the data is mounted on a volume external to the container, it really increases the amount of flexibility in some ways. Did the container die? Oh well, that’s what WAL is for. Need a different load-out in the container? Add a PgBouncer container so you can pause incoming connections and swap out the containers for little to no cost. Is the database otherwise relatively small, or represent a pure foreign data wrapper implementation for report aggregation? Distribute a container to anyone that wants or needs one, because they’re essentially free.

      There are quite a few production examples where this approach might make sense. I’m not brave enough to use them myself of course, and I generally wouldn’t recommend doing so, but I can see the perspective of those who do. To a certain extent, so long as the data survives, the containers themselves are irrelevant. There’s probably a way to make that work in a well orchestrated stack like Kubernetes or OpenShift. There’s a reason those exist and are evolving so rapidly, after all.

      Honestly, all of that is what inspired this series in the first place. I guess we’ll see. :)

Leave a Reply

Your email address will not be published. Required fields are marked *