2ndQuadrant, Professional PostgreSQL

2ndQuadrant Ltd official blog

Whether or not you made it our CHAR(10) conference last month, you can now relive part of the experience by downloading the conference slides. Some of those were posted live during the conference, some showed up later, but almost everything is there now. Sadly, Nic Ferrier's entertaining presentation about how WooMe was scaled up using Londiste and Django wasn't available in a form we could easily replay. For that one, you certainly did have to be there, in more ways than one.

The two talks I found the most informative were the updates on the states of pgpool-II and pgmemcache. Both those tools have that slightly frustrating combination of being really useful and a bit underdocumented relative to how complicated they are (in English at least!), so getting additional insight into them from those actually working on the code was great.

Markus's discussion of MVCC and clustering also had a fun twist to it. His talk ended with a performance analysis of his Postgres-R against pgpool-II, Postgres-XC, and PostgreSQL 9 using Streaming Replication plus Hot Standby, all used in cluster configurations to accelerate dbt2 test results. I don't quite agree with his premise there that network congestion is the most vital cluster component because "overall computing power, memory and storage capacity scale easily"--that's not always true--but it was satisfying to see that the PG9 HS/SR pairing is efficient in that regard.

The conference set aside two sessions to talk about general clustering topics in a relatively unstructured way. The more heated discussion talked about what would make PostgreSQL deployments into cloud computing infrastructure easier to deal with. That stirred up enough ideas to generate two blog entries from my coworkers already.

One of the ideas from that session I found particularly interesting was noting that if you have a deployment where nodes are added in the "elastic" way people like to discuss in relation to the cloud concept, there's a manageability gap there right now in terms of making it easy for applications to talk to that node set. If you can put pgpool-II or pgBouncer between your application and the set of nodes, you can abstract away exactly what's behind the nodes a bit right now. But now you've added another layer and therefore a potential bottleneck to the whole thing. That's the opposite of what elastic cloud deployments are supposed to be about: just adding capacity as needed with minimal management work.

A solution approach suggested was making it easier to build a database routing directory at the application level, so that apps can just ask for the type of node needed and get one to directly connect to. Nodes can just register themselves to the directory as they are brought online (or are taken down). This has similarities to some components that are already floating around. The directory lookup part you might put into LDAP; PostgreSQL servers can already announce themselves via ZeroConf AKA Bonjour. It's not hard to imagine bolting those two together, putting an application layer that does LDAP lookups connected to a routing backend that tracks available nodes via any number of protocols. As usual, the devil's in the details. Things like timing out failed nodes, distinguishing between read and write traffic (pgpool-II does it by actually parsing the SQL, which is expensive), and making the resulting directory broadcasts cached for high performance while also featuring cache invalidation are all tricky implementation details to get right.

With PostgreSQL 9.0 featuring more ways than ever to scale upward database architecture, this problem isn't going away though. I'm not sure what form yet people are going to solve it in, but it's a common enough problem that it's worth solving.

Officially Greenplum Database Single Node Edition (SNE) is only installable on Red Hat Enterprise Linux (RHEL) and SUSE Linux Enteprise Server (SLES), but while surfing the web I have seen many requests on how to install it on Debian/Ubuntu. Here I’m trying to give you some advices.

One of the main reasons users switch from other relational databases to PostgreSQL is the advanced support for geographic objects included in the PostGIS extension.

Being PostgreSQL specialists at 2ndQuadrant, we have tried to investigate if it was possible (and how) to install PostGIS on the Greenplum Single Node edition. Let's see how Marco Nenciarini, 2ndQuadrant consultant and a long time Debian developer, tried to do it.

Last week at the CHAR(10) conference we had a workshop on "Cloud Databases". To put it simply: what to do when the use case requirements exceed the resources available in the database server.

This was a main topic of the whole conference, and several solutions have been illustrated during the day. A common theme has been that no solution fits all the use cases, and that each solution comes with its cost; hence you have to choose the solution that your use case can afford.

CHAR(10) - Clustering, high availability, and replication conference with top international speakers. On-line booking registration now being taken at http://www.char10.org.

This week I did something I'd prefer to never repeat:  I left the country, did something useful, and made it back again in the same day.  The occasion was the FreeBSD Developer Summit, held just before BSDCan--the convention that happens in Ottawa the week before PGCon every year.  So I get to head right back again next week, but stay a while that time.

The FreeBSD developers were nice enough to sponsor my trip so that we could talk about both the business and technical hurdles that I felt were keeping the sort of companies I work with from deploying their databases on FreeBSD more often than they do.  My slightly updated slides are available on our talks page, I cleaned up a couple of things from what was presented (the most important rewording I'll talk about below).

I was very pleased at how friendly and receptive the developers were even to some of my critical comments.  FreeBSD and PostgreSQL have very like minded communities:  open for any purpose BSD license, academic roots, developers focused on stability, and even a strong documentation culture.  There's been plenty of cross-over too.

Much of the PostgreSQL infrastructure has been run using FreeBSD jails for quite some time (although plans are moving to use more Debian in its place, details on why at Inside the PostgreSQL Project Infrastructure).  My running joke during the talk was that if PostgreSQL developers are eating their own dog food by deploying critical infrastructure that depends on the database, much of that has been served in a FreeBSD bowl.  (The lunch at the conference session was pizza, much better choice)

And there's been plenty of FreeBSD development that's used PostgreSQL benchmarking as a measuring stick for the success of their advances.  The very popular Introducing FreeBSD 7.0 slides that not only showed their achieving performance parity against Linux during that release, it doubled as a document showing how PostgreSQL outscales MySQL.  Cheers all around for community driven, BSD licensed code.

One bit of audience contention during my talk came from my assertion that not having support for Emulex fiber channel cards in FreeBSD was preventing a significant amount of "big iron" adoption for databases, due to their perception as the market leader for connecting up expensive hardware like SANs.  The guys from FreeBSD hardware and support vendor iXsystems called me out on that, suggesting that the alternative vendor here--QLogic--is both completely trusted by the big boys and has top notch FreeBSD driver quality.

I did a bit more research into whether I was suffering from sampling bias from the set of people I'd talked to about this, and it looks like that was the case.  While Emulex claims they've been named Sun's "Best-in-Class Supplier for OEM products", and all the Sun FC cards I've personally run into came from them, there are tons of Sun rebrands of both Emulex and QLogic cards.  Same thing is true at all the other vendors I mentioned in my talk; you can get FC cards from both manufacturers via HP and Dell too.  I think my general point, that not supporting both Emulex and QLogic hurts the perception of FreeBSD as a serious choice for large businesses, still stands; it's just not quite as bad as I'd feared.  Accordingly, I tweaked the wording in the slides I'm publishing, to better match reality here than the ones I presented.

In additional to the solid core they've been growing for years, FreeBSD's license has allowed them to incorporate two very valuable features Sun released as open-source, ZFS and DTrace, into their operating system, both of which are incompatible with Linux's license and are extremely valuable for PostgreSQL deployments.  It's still not ideal yet; FreeBSD DTrace can currently be used only by root for example.  Limitations such as these have in the past kept me from being particularly motivated to work with FreeBSD.  The existence of a free commercial Solaris that ran on generic hardware, combined with the steady progress and open enough community around OpenSolaris, satisfied my needs better.  While not many of my PostgreSQL installations have been on Solaris, its has a monopoly share for hosting the terabyte scale databases I've worked with.  High quality filesystem snapshots via ZFS and the additional piece of mind you get from disk block checksums alone justified those platform decisions.

The problem today is that hating everything about how Oracle does business is what got me working with PostgreSQL in the first place, and now that they own Sun they're doing the same things to Solaris.  No more Solaris on non-Sun hardware, serious cutbacks on the open-source version (OpenSolaris looks like a walking corpse to me), cutting off even basic OS patches unless you have a support contract--that's what we've seen just in the first round from Oracle here.  Solaris isn't free in any sense of the word again, we're right back to the same dynamics that pushed me away from them and toward Linux fifteen years ago.

But I continue to be dissapointed at how little focus there is on quality control in Linux.  How poorly the filesystem mechanics work for the sorts of database work I do doesn't help either.  The Linux OOM killer might as well be named the Linux PostgreSQL Hater for how it acts on my servers.  And those sexy Solaris features I know work so well for databases, still not there (even if SystemTap is getting better at DTrace emulation).

Meanwhile, FreeBSD has the whole "free" thing sorted out right in their name, and their quality control paranoia is similar to that of your typical good DBA.  It looks to me like they're very close to fully assimilating ZFS and DTrace to the point where they can start improving them, rather than just working on getting the original feature set Solaris already had complete and the matching code stable.  I think all of us who work on business critical PostgreSQL deployments and who value free software should do a sanity check on just what dog food we're chewing on, and start making sure there's a FreeBSD bowl there at least sometimes.  From what I heard this week, the FreeBSD developers are gearing for another round of chewing on ours too.  They're looking into database oriented performance improvements as part of future development, and they're not any happier about using MySQL for that than I am about running PostgreSQL on Solaris.  Looks like it might be bowls of dog food all around.  Nobody said that leading the software industry was going to be tasty.

If you have a Linux server of the RedHat family (inclusing CentOS and Fedora), you might envy the way Debian/Ubuntu distributions handle PostgreSQL clusters management.

Although it is not easy to install different PostgreSQL versions on the same RedHat Linux server using RPMs, it is much simpler to install several instances of PostgreSQL (servers) and, at the same time, take advantage of the services infrastructure.

If you're running Linux, and particularly if you're running a database on Linux, it's been hard to recommend any filesystem other than plain old ext3 in recent years.  Some of the alternatives that looked interesting at one point--jfs, ReiserFS--are completely abandoned at this point.  The one that has been almost viable for some time now is XFS, originally an SGI projecs.  And it's back to being in the limelight again this week.

XFS had suffered from a number of problems in the past.  Since it was designed for stable hardware, it wasn't as robust on standard cheap PC hardware at first; quite a bit of that was just cleaned up two years ago.  It had this odd problem with zeroed files that scared some people off.  It was treated as a second-class citizen in business oriented Linux distributions like RedHat, requiring you to compile your own kernel; even on the less restrictive CentOS, you had to do some strange looking setup steps to add XFS support, and the result was quite obviously unsupported.  And as one of the first filesystems to turn on and aggressively utilize write barriers, deployments were vulnerable to drives and controllers that didn't flush their caches when told to, an issue you don't find as often on modern hardware anymore if you configure it right (except for SSDs, but that's another story).

So why bother?  Well, performance is one major reason.  I found myself working with XFS again when working with Greenplum's free Single Node Edition software recently.  Greenplum told me flat out that they didn't recommend anything but XFS for high-performance installs, and given the underlying similarities to community PostgreSQL I felt that was worth investigating why that was some more.

The timing on that turned out to be perfect.  One of the other limitations of ext3 is that on common hardware it will only support 16TB of storage.  Since you can put that much storage in a medium sized disk rack now, that's clearly not enough for high-end systems nowadays, much less a few years from now.  Realizing that, RedHat has been seriously reviving their support for XFS in their distribution of Linux.  RHEL 5.4, released a few months ago, added it back in as an optional module for some customers.  You still couldn't install on XFS, and even the CentOS version didn't support 32-bit installs, but it was clearly making steps toward mainstream again.

Yesterday the first public beta of RHEL6 was released, and XFS is back to being right in the major feature set.  It's sitting next to ext4 on the supported filesystem list, pointing out its suitablity for large installations in particular.  So I can now tell people that they have XFS support available in somewhat rough form in RHEL/CentOS 5.4, with the expectation that it's a first class supported filesystem as systems are upgraded to RHEL6 and its derivates in the future, and have some hope that will be reliable.

With the enteprise Linux support and accordingly the perceived stability side of the XFS code finally under control again, how about the performance?  Well, it turns out Greenplum was right about XFS being worth the trouble to get running.  I took my test server and reformatted one of its moderately fast drives with three different filesystem/mount combinations:  ext3 ordered, ext3 journal, and xfs.  After three bonnie++ 1.96 runs with each filesystem, the results I saw broke down like this:

  • ext3 ordered:  39-58MB/s write, 44-72MB/s read
  • ext3 journal:  25-30MB/s write, 49-67MB/s read
  • xfs:  68-72MB/s write, 72-77MB/s read

While the best of the ext3 read results approached similar levels to what xfs was capable of, on average it did much better.  And the write results were at least 25% better in all cases.  I liked the tighter, more predictable throughput as well; inconsistent performance is something I often struggle with on ext3.

I'm not normally one to be an early adopter of new Linux releases, but the RHEL6 beta with full XFS support has replaced the thorougly underwhelming new Ubuntu release at the top of my list of OSes to install next.  It's not often you see filesystem technology get a second chance to impress, but XFS seems to have made an unexpected transition back to completely relevant again, for now.  I'm not sure how long that will be true, with both ext4 available already and btrfs coming closer to production quality by recently reaching a stable disk format.  It will be interesting to see how this reinvigorated set of filesystem choices on Linux plays out.
A few weeks ago I presented an updated 2010 version of my talk on database hardware benchmarking at PG East; slides available from our talks page.  CPU and memory performance are particularly important for a PostgreSQL database, because every individual query runs as a single process.  Therefore, the speed of your fastest core determines how fast any one query can execute at, and in modern systems that's quite likely to bottleneck based on memory speed.

One of the things that's obvious from recent memory speed results is that all of AMD's processors have been stuck in a distant second place for almost 18 months now.  While AMD continues to use DDR2-800, Intel's "Nehalem" processors, shipping in volume since early 2009, have been adopting increasingly fast DDR3 in good performing multi-channel configurations--the exact area AMD used to be the king of.  In the normal single or dual core server configuration, Intel has had such a lead that it's been impossible to recommend them for anything but a completely disk-bound workload for some time now.

Like many commentaries on PC hardware, my suggestions were only cutting edge for...drumroll please...one week.  Basically, the minute my talk was over, AMD released a new line of 12-core processors that use DDR-1333, and they've closed most of the gap with Intel again.  In raw memory performance, they've increased memory performance 130% over their earlier design, and actually pulled ahead on that low-level benchmark.

How about database workloads?  One of the supporting bits of data I pointed to for how much the CPU/memory performance could impact a database workload were the Oracle Charbench "Calling Circle" OLTP benchmark results run by AnandTech.  Their new Calling Circle results show where the market is at now.  Intel still owns the top part of the market, but AMD's results with their Opteron 6174 are back to respectable. 

If you have a workload where more cores is what you need most of the time, the new processors from AMD could be just what you're looking for.  Fast enough for single queries again, scaling up quite well to handle workloads with many clients.  Memory technology really matters, and you should make sure to note (and benchmark yourself!) the speed of any system you're considering or using to make sure it's appropriate for your workload.

How long will this situation continue?  Well, Intel's next big server processor refresh, codenamed Sandy Bridge, is expected by the end of 2010.  Progress marches on.

I have been thinking for a while now about adding Greenplum support to an open-source application for web analytics that I wrote a few years ago, which is called ht://Miner and uses PostgreSQL.

In order to do this, I need a multi-CPU environment. While still waiting to get our new servers installed here in our data centre in Italy, I decided to look at Amazon's Elastic Compute Cloud (EC2) infrastructure. My intention is to do some benchmarking and spot the main differences in terms of performances between Greenplum Single Node Edition and PostgreSQL 8.4, my favourite DBMS.

If you wish to follow this article, you need to have an Amazon AWS account with a valid credit card. Do not worry, this test will only cost you a couple of dollars!