<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>2ndQuadrant, Professional PostgreSQL - Greg's PlanetPostgreSQL</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/" />
    <link rel="self" type="application/atom+xml" href="http://blog.2ndquadrant.com/en/atom-greg-planetpostgresql.xml" />
    <id>tag:blog.2ndquadrant.com,2010-01-19:/en//3</id>
    <updated>2010-07-25T22:17:15Z</updated>
    <subtitle>2ndQuadrant Ltd official blog</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.12</generator>

<entry>
    <title>Heads in the cloud at CHAR(10)</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/07/heads-in-the-cloud-at-char10.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.97</id>

    <published>2010-07-25T21:14:37Z</published>
    <updated>2010-07-25T22:17:15Z</updated>

    <summary>Whether or not you made it our CHAR(10) conference last month, you can now relive part of the experience by downloading the conference slides. Some of those were posted live during the conference, some showed up later, but almost everything...</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="International News" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlcloudchar10" label="postgresql cloud char10" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Whether or not you made it our CHAR(10) conference last month, you can now relive part of the experience by downloading the <a href="http://projects.2ndquadrant.com/char10">conference slides</a>.  Some of those were posted live during the conference, some showed up later, but almost everything is there now.  Sadly, Nic Ferrier's entertaining presentation about how <a href="http://www.woome.com/">WooMe</a> was scaled up using Londiste and Django wasn't available in a form we could easily replay.  For that one, you certainly did have to be there, in more ways than one.</p>

<p>The two talks I found the most informative were the updates on the states of pgpool-II and pgmemcache.  Both those tools have that slightly frustrating combination of being really useful and a bit underdocumented relative to how complicated they are (in English at least!), so getting additional insight into them from those actually working on the code was great.</p>

<p>Markus's discussion of MVCC and clustering also had a fun twist to it.  His talk ended with a performance analysis of his Postgres-R against pgpool-II, Postgres-XC, and PostgreSQL 9 using Streaming Replication plus Hot Standby, all used in cluster configurations to accelerate dbt2 test results.  I don't quite agree with his premise there that network congestion is the most vital cluster component because "overall computing power, memory and storage capacity scale easily"--that's not always true--but it was satisfying to see that the PG9 HS/SR pairing is efficient in that regard.</p>

<p>The conference set aside two sessions to talk about general clustering topics in a relatively unstructured way.  The more heated discussion talked about what would make PostgreSQL deployments into cloud computing infrastructure easier to deal with.  That stirred up enough ideas to generate two <a href="http://blog.2ndquadrant.com/en/2010/07/some-ideas-about-lowlevel-reso.html">blog</a> <a href="http://blog.tapoueh.org/articles/blog/_MVCC_in_the_Cloud.html">entries</a> from my coworkers already.</p>

<p>One of the ideas from that session I found particularly interesting was noting that if you have a deployment where nodes are added in the "elastic" way people like to discuss in relation to the cloud concept, there's a manageability gap there right now in terms of making it easy for applications to talk to that node set.  If you can put pgpool-II or pgBouncer between your application and the set of nodes, you can abstract away exactly what's behind the nodes a bit right now.  But now you've added another layer and therefore a potential bottleneck to the whole thing.  That's the opposite of what elastic cloud deployments are supposed to be about:  just adding capacity as needed with minimal management work.</p>

<p>A solution approach suggested was making it easier to build a database routing directory at the application level, so that apps can just ask for the type of node needed and get one to directly connect to.  Nodes can just register themselves to the directory as they are brought online (or are taken down).  This has similarities to some components that are already floating around.  The directory lookup part you might put into LDAP; PostgreSQL servers can already announce themselves via ZeroConf AKA Bonjour.  It's not hard to imagine bolting those two together, putting an application layer that does LDAP lookups connected to a routing backend that tracks available nodes via any number of protocols.  As usual, the devil's in the details.  Things like timing out failed nodes, distinguishing between read and write traffic (pgpool-II does it by actually parsing the SQL, which is expensive), and making the resulting directory broadcasts cached for high performance while also featuring cache invalidation are all tricky implementation details to get right.</p>

<p>With PostgreSQL 9.0 featuring more ways than ever to scale upward database architecture, this problem isn't going away though.  I'm not sure what form yet people are going to solve it in, but it's a common enough problem that it's worth solving.</p>]]>
        
    </content>
</entry>

<entry>
    <title>PostgreSQL, FreeBSD, and Free Dog Food</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/05/postgresql-freebsd-and-free-do.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.89</id>

    <published>2010-05-14T05:28:46Z</published>
    <updated>2010-05-14T07:37:24Z</updated>

    <summary><![CDATA[ This week I did something I'd prefer to never repeat:&nbsp; I left the country, did something useful, and made it back again in the same day.&nbsp; The occasion was the FreeBSD Developer Summit, held just before BSDCan--the convention that...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlfreebsd" label="postgresql freebsd" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[ This week I did something I'd prefer to never repeat:&nbsp; I left the country, did something useful, and made it back again in the same day.&nbsp; The occasion was the FreeBSD Developer Summit, held just before <a href="http://www.bsdcan.org/">BSDCan</a>--the convention that happens in Ottawa the week before PGCon every year.&nbsp; So I get to head right back again next week, but stay a while that time.<br /><br />The FreeBSD developers were nice enough to sponsor my trip so that we could talk about both the business and technical hurdles that I felt were keeping the sort of companies I work with from deploying their databases on FreeBSD more often than they do.&nbsp; My slightly updated slides are available on our <a href="http://projects.2ndquadrant.it/talks">talks page</a>, I cleaned up a couple of things from what was presented (the most important rewording I'll talk about below).<br /><br />I was very pleased at how friendly and receptive the developers were even to some of my critical comments.&nbsp; FreeBSD and PostgreSQL have very like minded communities:&nbsp; open for any purpose BSD license, academic roots, developers focused on stability, and even a strong documentation culture.&nbsp; There's been plenty of cross-over too.<br /><br />Much of the PostgreSQL infrastructure has been run using FreeBSD jails for quite some time (although plans are moving to use more Debian in its place, details on why at <a href="http://postgresqlconference.org/2010/east/talks/inside_the_postgresql_project_infrastructure">Inside the PostgreSQL Project Infrastructure</a>).&nbsp; My running joke during the talk was that if PostgreSQL developers are eating their own dog food by deploying critical infrastructure that depends on the database, much of that has been served in a FreeBSD bowl.&nbsp; (The lunch at the conference session was pizza, much better choice)<br /><br />And there's been plenty of FreeBSD development that's used PostgreSQL benchmarking as a measuring stick for the success of their advances.&nbsp; The very popular <a href="http://people.freebsd.org/%7Ekris/scaling/7.0%20Preview.pdf">Introducing FreeBSD 7.0</a> slides that not only showed their achieving performance parity against Linux during that release, it doubled as a document showing how PostgreSQL outscales MySQL.&nbsp; Cheers all around for community driven, BSD licensed code.<br /><br />One bit of audience contention during my talk came from my assertion that not having support for Emulex fiber channel cards in FreeBSD was preventing a significant amount of "big iron" adoption for databases, due to their perception as the market leader for connecting up expensive hardware like SANs.&nbsp; The guys from FreeBSD hardware and support vendor <a href="http://www.ixsystems.com/">iXsystems</a> called me out on that, suggesting that the alternative vendor here--QLogic--is both completely trusted by the big boys and has top notch FreeBSD driver quality.<br /><br />I did a bit more research into whether I was suffering from sampling bias from the set of people I'd talked to about this, and it looks like that was the case.&nbsp; While Emulex claims they've been named Sun's "<a href="http://www.emulex.com/partners/oems/sun-microsystems.html">Best-in-Class Supplier for OEM products</a>", and all the Sun FC cards I've personally run into came from them, there are tons of Sun rebrands of both <a href="http://blogs.sun.com/jmcp/entry/current_sun_emulex_fc_hba">Emulex</a> and <a href="http://blogs.sun.com/jmcp/entry/current_sun_qlogic_fc_hba">QLogic</a> cards.&nbsp; Same thing is true at all the other vendors I mentioned in my talk; you can get FC cards from both manufacturers via <a href="http://h18006.www1.hp.com/storage/saninfrastructure/hba.html">HP</a> and <a href="http://www.dell.com/us/en/enterprise/networking/blade-fibre-channel-card/cp.aspx?refid=blade-fibre-channel-card">Dell</a> too.&nbsp; I think my general point, that not supporting both Emulex and QLogic hurts the perception of FreeBSD as a serious choice for large businesses, still stands; it's just not quite as bad as I'd feared.&nbsp; Accordingly, I tweaked the wording in the slides I'm publishing, to better match reality here than the ones I presented.<br /><br />In additional to the solid core they've been growing for years, FreeBSD's license has allowed them to incorporate two very valuable features Sun released as open-source, ZFS and DTrace, into their operating system, both of which are incompatible with Linux's license and are extremely valuable for PostgreSQL deployments.&nbsp; It's still not ideal yet; FreeBSD DTrace can currently be used <a href="http://www.freebsd.org/doc/en/books/handbook/dtrace-implementation.html">only by root</a> for example.&nbsp; Limitations such as these have in the past kept me from being particularly motivated to work with FreeBSD.&nbsp; The existence of a free commercial Solaris that ran on generic hardware, combined with the steady progress and open enough community around OpenSolaris, satisfied my needs better.&nbsp; While not many of my PostgreSQL installations have been on Solaris, its has a monopoly share for hosting the terabyte scale databases I've worked with.&nbsp; High quality filesystem snapshots via ZFS and the additional piece of mind you get from disk block checksums alone justified those platform decisions.<br /><br />The problem today is that hating everything about how Oracle does business is what got me working with PostgreSQL in the first place, and now that they own Sun they're doing the same things to Solaris.&nbsp; No more Solaris on non-Sun hardware, serious cutbacks on the open-source version (OpenSolaris looks like a walking corpse to me), cutting off even basic OS patches unless you have a support contract--that's what we've seen just in the first round from Oracle here.&nbsp; Solaris isn't free in any sense of the word again, we're right back to the same dynamics that pushed me away from them and toward Linux fifteen years ago.<br /><br />But I continue to be dissapointed at how little focus there is on quality control in Linux.&nbsp; How poorly the filesystem mechanics work for the sorts of database work I do doesn't help either.&nbsp; The Linux OOM killer might as well be named the Linux PostgreSQL Hater for how it acts on my servers.&nbsp; And those sexy Solaris features I know work so well for databases, still not there (even if SystemTap is getting better at DTrace emulation).<br /><br />Meanwhile, FreeBSD has the whole "free" thing sorted out right in their name, and their quality control paranoia is similar to that of your typical good DBA.&nbsp; It looks to me like they're very close to fully assimilating ZFS and DTrace to the point where they can start improving them, rather than just working on getting the original feature set Solaris already had complete and the matching code stable.&nbsp; I think all of us who work on business critical PostgreSQL deployments and who value free software should do a sanity check on just what dog food we're chewing on, and start making sure there's a FreeBSD bowl there at least sometimes.&nbsp; From what I heard this week, the FreeBSD developers are gearing for another round of chewing on ours too.&nbsp; They're looking into database oriented performance improvements as part of future development, and they're not any happier about using MySQL for that than I am about running PostgreSQL on Solaris.&nbsp; Looks like it might be bowls of dog food all around.&nbsp; Nobody said that leading the software industry was going to be tasty.<br />]]>
        
    </content>
</entry>

<entry>
    <title>The Return of XFS on Linux</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/04/the-return-of-xfs-on-linux.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.86</id>

    <published>2010-04-22T15:29:20Z</published>
    <updated>2010-04-22T16:41:25Z</updated>

    <summary><![CDATA[If you're running Linux, and particularly if you're running a database on Linux, it's been hard to recommend any filesystem other than plain old ext3 in recent years.&nbsp; Some of the alternatives that looked interesting at one point--jfs, ReiserFS--are completely...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="xfslinuxgreenplum" label="xfs linux greenplum" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[If you're running Linux, and particularly if you're running a database on Linux, it's been hard to recommend any filesystem other than plain old ext3 in recent years.&nbsp; Some of the alternatives that looked interesting at one point--jfs, ReiserFS--are completely abandoned at this point.&nbsp; The one that has been almost viable for some time now is XFS, originally an SGI projecs.&nbsp; And it's back to being in the limelight again this week.<br /><br />XFS had suffered from a number of problems in the past.&nbsp; Since it was <a href="http://lkml.org/lkml/2007/3/28/316">designed for stable hardware</a>, it wasn't as robust on standard cheap PC hardware at first; quite a bit of that was just <a href="http://thread.gmane.org/gmane.comp.file-systems.xfs.general/22268">cleaned up two years ago</a>.&nbsp; It had this odd problem with <a href="http://madduck.net/blog/2006.08.11:xfs-zeroes/">zeroed files</a> that scared some people off.&nbsp; It was treated as a second-class citizen in business oriented Linux distributions like RedHat, requiring you to <a href="http://web.archive.org/web/20080403003724/http://phaq.phunsites.net/2008/02/04/enabling-reiserfs-xfs-jfs-on-redhat-enterprise-linux/">compile your own kernel</a>; even on the less restrictive CentOS, you had to do some strange looking <a href="http://blogwords.neologix.net/neils/?p=1">setup steps</a> to add XFS support, and the result was quite obviously unsupported.&nbsp; And as one of the first filesystems to <a href="http://xfs.org/index.php/XFS_FAQ">turn on and aggressively utilize write barriers</a>, deployments were vulnerable to drives and controllers that didn't flush their caches when told to, an issue you don't find as often on modern hardware anymore if you configure it right (except for SSDs, but that's another story).<br /><br />So why bother?&nbsp; Well, performance is one major reason.&nbsp; I found myself working with XFS again when working with Greenplum's free <a href="http://www.greenplum.com/products/single-node/">Single Node Edition</a> software recently.&nbsp; Greenplum told me flat out that they didn't recommend anything but XFS for high-performance installs, and given the underlying similarities to community PostgreSQL I felt that was worth investigating why that was some more.<br /><br />The timing on that turned out to be perfect.&nbsp; One of the other limitations of ext3 is that on common hardware it will only support <a href="http://en.wikipedia.org/wiki/Ext3">16TB of storage</a>.&nbsp; Since you can put that much storage in a medium sized disk rack now, that's clearly not enough for high-end systems nowadays, much less a few years from now.&nbsp; Realizing that, RedHat has been seriously reviving their support for XFS in their distribution of Linux.&nbsp; RHEL 5.4, released a few months ago, added it back in as an optional module for some customers.&nbsp; You still couldn't <a href="http://phaq.phunsites.net/2008/02/04/enabling-reiserfs-xfs-jfs-on-redhat-enterprise-linux/">install on XFS</a>, and even the CentOS version <a href="http://wiki.centos.org/Manuals/ReleaseNotes/CentOS5.4">didn't support 32-bit installs</a>, but it was clearly making steps toward mainstream again.<br /><br />Yesterday the first <a href="http://press.redhat.com/2010/04/21/red-hat-enterprise-linux-6-beta-available-today-for-public-download/">public beta of RHEL6</a> was released, and XFS is back to being right in the major feature set.&nbsp; It's sitting next to ext4 on the <a href="http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6-Beta/html/Beta_Release_Notes/filesystems.html">supported filesystem</a> list, pointing out its suitablity for large installations in particular.&nbsp; So I can now tell people that they have XFS support available in somewhat rough form in RHEL/CentOS 5.4, with the expectation that it's a first class supported filesystem as systems are upgraded to RHEL6 and its derivates in the future, and have some hope that will be reliable.<br /><br />With the enteprise Linux support and accordingly the perceived stability side of the XFS code finally under control again, how about the performance?&nbsp; Well, it turns out Greenplum was right about XFS being worth the trouble to get running.&nbsp; I took my test server and reformatted one of its moderately fast drives with three different filesystem/mount combinations:&nbsp; ext3 ordered, ext3 journal, and xfs.&nbsp; After three bonnie++ 1.96 runs with each filesystem, the results I saw broke down like this:<br /><br /><ul><li>ext3 ordered:&nbsp; 39-58MB/s write, 44-72MB/s read</li><li>ext3 journal:&nbsp; 25-30MB/s write, 49-67MB/s read</li><li>xfs:&nbsp; 68-72MB/s write, 72-77MB/s read</li></ul><br />While the best of the ext3 read results approached similar levels to what xfs was capable of, on average it did much better.&nbsp; And the write results were at least 25% better in all cases.&nbsp; I liked the tighter, more predictable throughput as well; inconsistent performance is something I often struggle with on ext3.<br /><br />I'm not normally one to be an early adopter of new Linux releases, but the RHEL6 beta with full XFS support has replaced the thorougly underwhelming new Ubuntu release at the top of my list of OSes to install next.&nbsp; It's not often you see filesystem technology get a second chance to impress, but XFS seems to have made an unexpected transition back to completely relevant again, for now.&nbsp; I'm not sure how long that will be true, with both ext4 available already and btrfs coming closer to production quality by recently reaching a <a href="https://btrfs.wiki.kernel.org/index.php/Main_Page">stable disk format</a>.&nbsp; It will be interesting to see how this reinvigorated set of filesystem choices on Linux plays out. <br />]]>
        
    </content>
</entry>

<entry>
    <title>AMD, Intel, and PostgreSQL</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/04/amd-intel-and-postgresql.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.85</id>

    <published>2010-04-14T19:00:27Z</published>
    <updated>2010-04-14T19:31:29Z</updated>

    <summary><![CDATA[A few weeks ago I presented an updated 2010 version of my talk on database hardware benchmarking at PG East; slides available from our talks page.&nbsp; CPU and memory performance are particularly important for a PostgreSQL database, because every individual...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[A few weeks ago I presented an updated 2010 version of my talk on database hardware benchmarking at PG East; slides available from our <a href="http://projects.2ndquadrant.com/talks">talks page</a>.&nbsp; CPU and memory performance are particularly important for a PostgreSQL database, because every individual query runs as a single process.&nbsp; Therefore, the speed of your fastest core determines how fast any one query can execute at, and in modern systems that's quite likely to bottleneck based on memory speed.<br /><br />One of the things that's obvious from recent memory speed results is that all of AMD's processors have been stuck in a distant second place for almost 18 months now.&nbsp; While AMD continues to use DDR2-800, Intel's "Nehalem" processors, shipping in volume since early 2009, have been adopting increasingly fast DDR3 in good performing multi-channel configurations--the exact area AMD used to be the king of.&nbsp; In the normal single or dual core server configuration, Intel has had such a lead that it's been impossible to recommend them for anything but a completely disk-bound workload for some time now.<br /><br />Like many commentaries on PC hardware, my suggestions were only cutting edge for...drumroll please...one week.&nbsp; Basically, the minute my talk was over, AMD released a new line of 12-core processors that use DDR-1333, and they've closed most of the gap with Intel again.&nbsp; In raw memory performance, they've increased memory performance <a href="http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/5">130%</a> over their earlier design, and actually pulled ahead on that low-level benchmark.<br /><br />How about database workloads?&nbsp; One of the supporting bits of data I pointed to for how much the CPU/memory performance could impact a database workload were the Oracle Charbench "Calling Circle" OLTP benchmark results run by AnandTech.&nbsp; Their <a href="http://it.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/8">new Calling Circle results</a> show where the market is at now<strong></strong>.&nbsp; Intel still owns the top part of the market, but AMD's results with their Opteron 6174 are back to respectable.&nbsp; <br /><br />If you have a workload where more cores is what you need most of the time, the new processors from AMD could be just what you're looking for.&nbsp; Fast enough for single queries again, scaling up quite well to handle workloads with many clients.&nbsp; Memory technology really matters, and you should make sure to note (and benchmark yourself!) the speed of any system you're considering or using to make sure it's appropriate for your workload.<br /><br />How long will this situation continue?&nbsp; Well, Intel's next big server processor refresh, codenamed <a href="http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_%28microarchitecture%29">Sandy Bridge</a>, is expected by the end of 2010.&nbsp; Progress marches on.<br />]]>
        
    </content>
</entry>

<entry>
    <title>PGEast, Hardware Benchmarking, and the PG Performance Farm</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/03/pgeast-hardware-benchmarking-a.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.80</id>

    <published>2010-03-11T18:47:38Z</published>
    <updated>2010-03-11T19:54:54Z</updated>

    <summary><![CDATA[Today is the deadline for the special room rate at the hotel hosting this month's PostgreSQL Conference East 2010.&nbsp; If you've been procrastinating booking a spot at the conference, as of tomorrow that will start costing you.My talk is on...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="United States News" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="pgbench" label="pgbench" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[Today is the deadline for the <a href="http://www.postgresqlconference.org/east/2010/accommodations">special room rate</a> at the hotel hosting this month's PostgreSQL Conference East 2010.&nbsp; If you've been procrastinating booking a spot at the conference, as of tomorrow that will start costing you.<br /><br />My talk is on <a href="http://postgresqlconference.org/2010/east/talks/database_hardware_benchmarking">Database Hardware Benchmarking</a> and is scheduled for late afternoon on the first day, Thursday March 25th.&nbsp; Those who might have seen this talk before, either live at <a href="http://www.pgcon.org/2009/schedule/events/152.en.html">PGCon 2009</a> or via the video link available there, might be wondering if I'm going to drag out the same slides and talk again.&nbsp; Not the case; while the general philosophy of the talk ("trust no one, run your own benchmarks") stays the same, the examples and test mix suggested have been updated to reflect another year worth of hardware advances, PostgreSQL work, and my own research during that time.&nbsp; The Intel vs. AMD situation in particular has changed quite a bit, requiring a new set of memory benchmarks to really follow what's going on now.<br /><br />And PostgreSQL 9.0 fixed a major problem that kept it from normally delivering accurate results on Linux, due to a <a href="http://kerneltrap.org/mailarchive/linux-kernel/2008/5/21/1899434">kernel regression</a> that made much worse an already far too common situation:&nbsp; it's easy for a single pgbench client to become the bottleneck when running it, rather than the database itself.&nbsp; The review I did for <a href="http://archives.postgresql.org/message-id/alpine.GSO.2.01.0907291918380.19638@westnet.com">multi-threaded pgbench</a> (which can also be multi-process pgbench on systems that don't support threads) suggested a solid &gt;30% speedup even on systems that didn't have the bad kernel incompatibility on them.&nbsp; Subsequent testing suggests it can easily take 8 pgbench processes to get full throughput out of even inexpensive modern processors under recent Linux kernels.&nbsp; I'll go over exactly how that ends up playing out on such systems, and how this new feature makes it possible again to use pgbench as the primary way to measure CPU performance running the database.<br /><br /><br />Recently I've also made an updated to the <a href="http://github.com/gregs1104/pgbench-tools">git repo for pgbench-tools</a> that adds working support for PostgreSQL 8.4 and basic 9.0 compatibility, and the next update will include support for the multi-threaded option now that I've mapped out how that needs to work.&nbsp; This is all leading somewhere.&nbsp; Once we have accurate measurements for PostgreSQL performance that are CPU limited on the server side, something that hasn't often been the case for over two years now, those again become a useful way to monitor for performance regressions in the PostgreSQL codebase.&nbsp; The tests included will need to expand for that to cover more eventually, but for now we've reached a point where pgbench can be used to find regressions that impact how fast simple SELECT statements execute.&nbsp; I know that works as expected, because every time I accidentally build PostgreSQL with assertions on that's caught because I see the average processing rate drop dramatically.<br /><br />Once I've got a couple of systems setup here to test for such regressions, the question becomes how to automate what I'm doing, and then to do the same thing against a wider range of build checkouts.&nbsp; Ideally, you'd be able to see a graph of average SELECT performance each day, broken down by version, so that when a commit that reduced it was introduced it would immediately be obvious when the performance dropped.&nbsp; This is the dream goal for building a performance farm similar to the <a href="http://buildfarm.postgresql.org/">PostgreSQL buildfarm</a>.&nbsp;&nbsp;&nbsp; The pieces are almost all together now:&nbsp; my pgbench parts are wrapping up, extensions to the buildfarm to make it speak directly to git are moving along (not a requirement, but nobody working on this project wants to use CVS if we can avoid it), and the main thing missing at this point is someone to put the time in to integrate what I've been doing into a buildfarm-like client.<br /><br />And it looks like we now have a corporate sponsor willing to help with that chunk of work, who I'll let take credit for when we're all done, and that's scheduled to happen this summer.&nbsp; I fully expect that PostgreSQL 9.1 development, and 9.0 backpatching, is going to happen with an early performance farm in place to guard against performance regressions.&nbsp; If we can backport the new multi-threaded pgbench to older PostgreSQL versions we might include them in the mix as well.&nbsp; I already have a backport of the 8.3 pgbench, which has a lot of improvements, I maintain just for testing 8.2 systems.&nbsp; With pgbench as a fairly standalone contrib module, it's possible to build a later one different from the rest of the system, so long as it doesn't expect newer database features to exist too.<br /><br />If that's something you're interested in, my talk at the conference is going to map out the foundations I expect it to be built on.&nbsp; Regardless, hope you can make it to conference and enjoy the long list of talks being presented there.<br />]]>
        
    </content>
</entry>

<entry>
    <title>Trade-offs in Hot Standby Deployments</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/02/tradeoffs-in-hot-standby-deplo.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.79</id>

    <published>2010-02-24T07:25:39Z</published>
    <updated>2010-02-24T18:46:44Z</updated>

    <summary>The new Hot Standby feature in the upcoming PostgreSQL 9.0 allows running queries against standby nodes that previously did nothing but execute a recovery process. Two common expectations I&apos;ve heard from users anticipating this feature is that it will allow...</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="hotstandbyusergroup" label="Hot Standby User Group" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>The new Hot Standby feature in the upcoming PostgreSQL 9.0 allows running queries against standby nodes that previously did nothing but execute a recovery process.  Two common expectations I've heard from users anticipating this feature is that it will allow either distributing short queries across both nodes, or allow running long reports against the standby without using resources on the master.  These are both possible to do right now, but unless you understand the trade-offs involved in how Hot Standby works there can be some unanticipated behavior here.</p>

<h2>Standard Long-running Queries</h2>

<p>One of the traditional problems in a database using MVCC, like PostgreSQL, is that a long-running query has to keep open a resource--referred to as a <a href="http://www.postgresql.org/docs/current/static/transaction-iso.html">snapshot</a> in the current Postgres implementation--to prevent the database from removing data the query needs to operate.  For example, just because another client has deleted a row and committed, if an already running query needs that row to complete you can't actually wipe the physical disk blocks related to that row out just yet.  You have to wait until no open queries that expect that row to be visible are still around.</p>

<h2>Hot Standby Limitations</h2>

<p>If you have a long-running query you want Hot Standby to execute, there are a couple of types of bad things that can happen when the recovery process is applying updates.  These are described in detail in the <a href="http://developer.postgresql.org/pgdocs/postgres/hot-standby.html">Hot Standby Documentation</a>.  Some of these bad things will cause queries running on the standby to be canceled for reasons that might not be intuitively obvious:</p>

<ul>
<li>A HOT update or VACUUM related update arrives to delete something that query expects to be visible</li>
<li>A B-tree deletion appears</li>
<li>There is a locking issue between the query you're running and what locks are required for the update to be processed.</li>
</ul>

<p>The lock situation is difficult to deal with, but not very likely to happen in practice for all that long if you're just running read-only queries on the standby, because those will be isolated via MVCC.  The other two are not hard to run into.  The basic thing to understand is that <em>any</em> UPDATE or DELETE on the master can lead to interrupting any query on the standby; doesn't matter if the changes even relate to what the query is doing.</p>

<h2>Good, fast, cheap:  pick two</h2>

<p>Essentially, there are three things people might want to prioritize:</p>

<ol>
<li>Avoid master limiting:  Allow xids and associated snapshots to advance unbounded on the master, so that VACUUM and similar cleanup isn't held back by what the standby is doing</li>
<li>Unlimited queries:  Run queries on the slave for any arbitrary period of time</li>
<li>Current recovery:  Keep the recovery process on the standby up to date with what's happening on the master, allowing fast fail-over for HA</li>
</ol>

<p>In any situation with Hot Standby, it's literally impossible to have all three at once.  You can only pick your trade-off.  The tunable parameters available already let you optimize a couple of ways:</p>

<ul>
<li>Disabling all these delay/defer settings optimizes for always current recovery, but then you'll discover queries are more likely to be canceled than you might expect.</li>
<li><em>max_standby_delay</em> optimizes for longer queries, at the expense of keeping recovery current.  This delays applying updates to the standby once one that will cause a problem (HOT, VACUUM, B-tree delete, etc.) appears.
<li><em>vacuum_defer_cleanup_age</em> and some snapshot hacks can introduce some master limiting to improve on the other two issues, but with a weak UI to do that.  vacuum_defer_cleanup_age is in units of transaction IDs.  You need to have some idea the average amount of xid churn on your system per unit of time to turn the way people think about this problem ("defer by at least 1 hour so my reports will run") into a setting for this value.  xid consumption rate just isn't a common or even reasonable thing to measure/predict.  Alternately, you can open a snapshot on the primary before starting a long-running query on the standby.  dblink is suggested in the Hot Standby documentation as a way to accomplish that.  Theoretically a daemon on the standby could be written in user-land, living on the primary, to work around this problem too (Simon has a basic design for one).  Basically, you start a series of processes that each acquire a snapshot and then sleep for a period before releasing it.  By spacing out how long they each slept for you could ensure xid snapshots never advanced forward too quickly on the master.  It should already sound obvious how much of a terrible hack this would be.</li>
</ul>

<h2>Potential Improvements</h2>

<p>The only one of these you can really do something about cleanly is tightening up and improving the UI for the master limiting.  That turns this into the traditional problem already present in the database:  a long-running query holds open a snapshot (or at least limits the advance of visibility related transaction IDs) on the master, preventing the master from removing things needed for that query to complete.  You might alternately think of this as an auto-tuning vacuum_defer_cleanup_age.</p>

<p>The question is how to make the <em>primary</em> respect the needs of long running queries on the <em>standby</em>.  This might be possible if more information about the transaction visibility requirements of the standby were shared with the master.  Doing that sort of exchange would really be something more appropriate for the new Streaming Replication implementation to share.  The way a simple Hot Standby server is provisioned does not provide any feedback toward the master suitable for this data to be exchanged, besides approaches like the already mentioned dblink hack.</p>

<p>With PostgreSQL 9.0 just reaching a fourth alpha release, there may still be time to see some improvements in this area yet before the 9.0 release.  It would be nice to see Hot Standby and Streaming Replication really integrated together in a way that accomplishes things that neither is fully capable of doing on their own before coding on this release completely freezes.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Measuring PostgreSQL Checkpoint Statistics</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2010/01/measuring-postgresql-checkpoin.html" />
    <id>tag:blog.2ndquadrant.com,2010:/en//3.75</id>

    <published>2010-01-29T07:25:26Z</published>
    <updated>2010-01-29T18:29:41Z</updated>

    <summary>Checkpoints can be a major drag on write-heavy PostgreSQL installations. The first step toward identifying issues in this area is to monitor how often they happen, which just got an easier to use interface added to the database recently....</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlperformance" label="postgresql performance" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Checkpoints can be a major drag on write-heavy PostgreSQL installations.  The first step toward identifying issues in this area is to monitor how often they happen, which just got an easier to use interface added to the database recently.</p>]]>
        <![CDATA[<p>Checkpoints are periodic maintenance operations the database performs to make sure that everything it's been caching in memory has been synchronized with the disk.  The idea is that once you've finished one, you can eliminate needing to worry about older entries placed into the write-ahead log of the database.  That means less time to recover after a crash.</p>

<p>The problem with checkpoints is that they can be very intensive, because to complete one requires writing every single bit of changed data in the database's buffer cache out to disk.  There were a number of features added to PostgreSQL 8.3 that allow you to better monitor the checkpoint overhead, and to lower it by spreading the activity over a longer period of time.  I wrote a long article about those changes called  <a href="http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm">Checkpoints and the Background Writer</a> that goes over what changed, but it's pretty dry reading.</p>

<p>What you probably want to know is how to monitor checkpoints on your production system, and how to tell if they're happening too often.  Even though things have improved, "checkpoint spikes" where disk I/O becomes really heavy are still possible even in current PostgreSQL versions.  And it doesn't help that the default configuration is tuned for very low disk space and fast crash recovery rather than performance.  The checkpoint_segments parameter that's one input on how often a checkpoint happens defaults to 3, which forces a checkpoint after only 48MB of writes.</p>

<p>You can find out checkpoint frequency two ways.  You can turn on log_checkpoints and watch what happens in the logs.  You can also use the pg_stat_bgwriter view, which gives a count of each of the two sources for checkpoints (time passing and writes occurring) as well as statistics about how much work they did.</p>

<p>The main problem with making that easier to do is that until recently, it's been impossible to reset the counters inside of pg_stat_bgwriter.  That means you have to take a snapshot with a timestamp on it, wait a while, take another snapshot, then subtract all the values to derive any useful statistics from the data.  That's a pain.</p>

<p>Enough of a pain that I  <a href="http://archives.postgresql.org/message-id/4B4F8A96.5080004@2ndquadrant.com">wrote a patch</a> to make it easier.  With the current development version of the database, you can now call pg_stat_reset_shared('bgwriter') and pop all these values back to 0 again.  This allows following a practice that used to be common on PostgreSQL.  Before 8.3, there was a parameter named stats_reset_on_server_start you could turn on.  That reset all of the server's internal statistics each time you started it.  That meant that you could call the handy pg_postmaster_start_time() function, compare with the current time, and always have an accurate count in terms of operations/second of any statistic available on the system.</p>

<p>It's still not automatic, but now that resetting these shared pieces is possible you can do it yourself.  The first key is to integrate statistics clearing into your server startup sequence.  A script like this will work:</p>

<pre><code>
pg_ctl start -l $PGLOG -w
psql -c "select pg_stat_reset();"
psql -c "select pg_stat_reset_shared('bgwriter');"
</code></pre>

<p>Note the "-w" on the start command there--that will make pg_ctl wait until the server is finished starting before it returns, which is vital if you want to immediately execute a statement against it.</p>

<p>If you've done that, and your server start time is essentially the same as when the background writer stats started collection, you can now use this fun query:</p>

<pre><code>
SELECT 
  total_checkpoints,
  seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints
FROM 
  (SELECT 
      EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time())) AS seconds_since_start
      (checkpoints_timed+checkpoints_req) AS total_checkpoints 
    FROM pg_stat_bgwriter
  ) AS sub;
</code></pre>

<p>And get a simple report of exactly how often checkpoints are happening on your system.  The output looks like this:</p>

<pre><code>
total_checkpoints           | 9
minutes_between_checkpoints | 3.82999310740741
</code></pre>

<p>What you do with this information is stare at the average time interval and see if it seems too fast.  Normally, you'd want a checkpoint to happen no more than every five minutes, and on a busy system you might need to push it to ten minutes or more to have a hope of keeping up.  With this example, every 3.8 minutes is probably too fast--this is a system that needs checkpoint_segments to be higher.</p>

<p>Using this technique to measure the checkpoint interval lets you know if you need to increase the checkpoint_segments and checkpoint_timeout parameters in order to achieve that goal.  You can compute the numbers manually right now, and once 9.0 ships it's something you can consider making completely automatic--so long as you don't mind your stats going away each time the server restarts.</p>

<p>There are some other interesting ways to analyze the data the background writer provides for you in pg_stat_bgwriter, but I'm not going to give away all of my tricks today.</p>]]>
    </content>
</entry>

</feed>
