2ndQuadrant, Professional PostgreSQL

2ndQuadrant Ltd official blog

External web tables are one of the most useful features when you you have to load data into a Greenplum database from different sources.

This week I've been digging deep into PostgreSQL storage hardware again.  Since I'm giving a conference talk on database storage in Austin and in the DC area next week, it seems like a good time for me to actually know the material.  One of the most common questions here is "what's the cheapest SSD I can put my database on?", with the implied hope "...without losing it all the time". Last year the first inexpensive answer to that appeared on the market, and I suggested people take a look at Intel's 320 series drives.  With 217 days of runtime on my first 320 drive here, and Intel's 3rd generation storage line filled out with the more enterprise oriented 710 Series now, it's worth reviewing how that turned out.

It wasn't long after the 320 series drives were introduced that people started reporting a firmware problem with the drive, where it did things like report a capacity of 8MB after a restart along with "BAD_CTX 0000013x" errors.  A firmware update to fix that was released.  There's still some claims of continued problems floating around.  You have to expect some percentage of any product are going to be bad, and the later production of this drive (after the big bug was fixed) don't seem above the usual risk level in hard drives to me.  With the warranty here extended to 5 years (unless you're using it at 'enterprise usage levels'), I think that Intel would be getting killed if the reliability on these was as bad as some people claim.

The reason behind the usage level caveat is the main thing worth talking about here.  The long version of the warranty suggests "The media wear-out indicator reports a normalized value of 100 (when the SSD is brand new out of the factory) and declines to a minimum value of 1. When the value reads 1, this indicates that the SSD is reaching the wear-out limit".  Here's what my first 320 looks like so far:

[root@toy ~]# smartctl -a /dev/sdc
=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSA2CW120G3
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5225
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       34
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       201389
226 Load-in_Time            0x0032   100   100   000    Old_age   Always       -       2687040
227 Torq-amp_Count          0x0032   100   100   000    Old_age   Always       -       0
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       314526
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       201389
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       133021
Not all of these attributes are labeled correctly, and some are "Unknown"; all the gory details are in the product specifications. You can see the Media Wearout above; here's the raw values for other interesting ones, formatted so they're more blog-friendly:

ID# ATTRIBUTE_NAME          RAW_VALUE
  9 Power_On_Hours          5225
 12 Power_Cycle_Count       58
192 Power-Off_Retract_Count 34
225 Load_Cycle_Count        201389
241 Total_LBAs_Written      201389
242 Total_LBAs_Read         133021
192 (hex C0) is "Power-Off Retract Count".  That's how many unsafe shutdowns the drive has been through, which are the situations where the battery backed cache in the drive has been triggered.  With 34 of them here, you can see I've tried to get this drive to die that way.

The first interesting wear figure is 225 (hex code E1) which Intel's documentation describes as "Host Writes".  The units for that are 32MB.  If you look carefully, you'll see that's the same value given in 241 "Total LBAs Written".  That suggests the LBA unit for the drive is also 32MB, which I double-checked last year.  At 32MB each, my write value of 201389 means I've written 6.15TB to this drive.

Now, computing the true lifetime of an SSD depends on a couple of magic values, like the "write amplification" of your workload.  That suggests how often your workload forces small bits of data out to flash, using up some of the NAND cell lifetime faster than it might otherwise last.  These numbers are really hard to estimate.  The most realistic way is figure this out is to run a workload simulation after resetting the drive's internal counters, then see just how much you burned through.  The process is walked through with an example at "Measuring How Long Your Intel SSD Will Last", and it's not too hard to translate that example (which uses Intel's SSD Toolbox software) into a set of of smartctl commands if you're on Linux--the article even uses smartctl for the counter reset part.

The official documentation is this is Intel's Enterprise Server addendum, and here we finally find some hard numbers about the expected life of these drives.  My 120GB drive is said to have a "write endurance" of 15TB.  A pessimistic look at my sample drive here would check total writes and say that, having written over 6TB, I've gone through 40% of the drive lifetime.  But write endurance doesn't work that way; the firmware is constantly doing tricks to extend the life of the drive.  Intel's official number they sometimes tie the warranty to, the Media Wearout, is showing 99% left!  If that's true--I've only used 1% of the drive's lifespan--then I might manage 600TB of writes before this one really dies on me.

So what's the story with the true Enterprise lifetime 710 Series drives?  Those almost the same drives as the 320 series ones, with three significant changes.  First, they're said to use higher quality flash, probably with the same sort of "put the best tested chips first in the expensive models" approach Intel is said to use on their CPU production--what's sometimes called binning.  Second, the drives are overprovisioned with a lot more unused flash compared to the 320 series models, and unused flash really helps extend longevity.  Finally, they don't claim the capabilities to be quite as good.  Random write IOPS numbers on the 710 series drives are lower; my 120GB 320 series drive is specified at 14K write IOPS, while the 100GB 710 series only aims for 2700. The drive doesn't claim to support lots of tiny writes and still last for years, which means it's aimed at a different set of write amplification expectations.  Similarly, the 710 series drives don't refresh the stored cells in the same way.  The downside there is that 710 models are only specified to retain their data for 3 months.  That's probably fine for data center use, but that wouldn't be very acceptable to the more consumer oriented market the 320 series is sold to.

The end result of that, and how the 710 compares to the 320 series drives, is nicely summarized in the "Write Endurance" table in the Tom's Hardware Review.  Instead of the 15TB endurance number my 320 drive specifies, the similar 100GB 710 series model aims for 500TB.  That's just over 30X as long.  In the real world, there may not be that big of a difference, as shown by the projected 600TB figure I'm seeing out of my 320 drive so far.  But Intel's aiming at conservative engineering lifetimes on the specification sheets, and by that measure the storage cells 710 will last longer; the 320 models only may last longer.  And an expected lifetime 30X as long is something some people are surely willing to pay the 710's price premium for.

There are a few parts of the PostgreSQL internals that poke out usefully if you look in the right place for them.  One useful set to know about are the System Columns, which you can explicitly request but don't see by default.  For example:

psql -x -c "SELECT oid,* FROM pg_class LIMIT 1"
There is no column named oid in the pg_class table, but it's there if you ask for it.  The oid used to be relied on more heavily in PostgreSQL as a way to identify rows.  That's not true for regular tables anymore, and you really don't want to start doing that for your own tables.  OIDs are mainly useful now when joining parts of the System Catalog together.  A good example is the Disk Usage query.  If you want to find the namespace a table is in, you need to know you can ask for its OID.  It's possible to get some of this data out of more portable views like information_schema.tables.  But many of the useful things in this area are PostgreSQL specific.  Sometimes I see people starting with the information_schema views and joining against other tables using its text name fields, such as the listed table_name.  That approach has several edge cases that don't work out correctly; not handling TOAST columns is a common example.  That makes them more prone to breaking on you later, probably after your system has gone into production, than an OID based join.

There is also a tableoid system column.  As described in the documentation, its main use case is identifying which partition a row come from.  That's not a great thing to be driving application logic from, but it can be useful for monitoring or troubleshooting purposes.  For example, if you SELECT rows from the parent table in a partitioning inheritance scheme, it's normally expected that no rows will actually be stored there.  Checking the tableoid is one way to confirm that.  You might confirm that your INSERT/UPDATE trigger is moving rows to the right place using tableoid as well.  It's possible to do that for each individual partition section, but running a query against the parent will make sure you hit every row in the table.

Another internal column related to uniquely identifying rows is the ctid.  The ctid is a direct pointer to the physical block (using PostgreSQL's 8K page size) and position of a row.  ctids are a pair of numbers, and the first row will be (0,1).  While this is the fastest way to find a row more than once in the same transaction block, these numbers are not stable in the long term.  Any UPDATE and some maintenance operations will change them.  One thing you can use these for is finding duplicate data in a table.  Let's say you're trying to add a unique constraint, but one row in the table is duplicated 3 times, which blocks the unique index from being created.  When rows are identical in every column, you can't write any simple SELECT statement to uniquely identify them.  That means deleting all of them but one copy requires some annoying and fragile SQL code, combining DELETE with LIMIT and/or OFFSET--which is always scary.  If you use the ctid instead, the implementation will be PostgreSQL specific, but it will also be faster and cleaner.  See Deleting Duplicate Records in a Table for an example of how that can be done.

The other system columns all relate to transaction visibility:  xmin, cmin, xmax, cmax.  When you delete a row in PostgreSQL, it isn't eliminated from disk immediately.  It's possible that some other query that's executing at the same time will still need to see that row, and the transaction isolation in PostgreSQL worries about such things.  If you ever want to learn how that isolation works, the way the Multiversion Concurrency Control (MVCC) implementation is handled, you can watch parts of it happen.  Just open transactions in two different sessions, UPDATE/DELETE in one of them, and then look at those rows in the other.  You can still see them in the session where they weren't touched, but they'll be marked to expire in the future via their xmax being set.  To really pull that all together, you also need to know about some of the System Information Functionstxid_current() is the most useful for this sort of learning experience, it provides a reference point for the always increasing system transaction ID.  You can find a more detailed exploration of using these functions and system columns in Bruce's MVCC Unmasked talk.  The "Routine Maintenance" chapter of my book also shows examples how how MVCC works through the perspective of the system columns.

JDBC is the driver used to access a database with Java. Greenplum has a full working JDBC implementation. In this short article we'll see how to use it.

With an announce on the forum, Greenplum staff has spoke out about the new version of their Database Management System. I can't resist to blog about some of its new features.

In the previous article we have seen how to install Greenplum on multiple nodes. After installation steps, we must init the entire system. Let's see how.

One of the main advantages using Greenplum is that it gains power when it uses multiple nodes. Horizontal scalability is a main feature of Greenplum.

Here is a compact handbook to install a multi-node Data Warehouse environment with Greenplum.

The fifth edition of the Italian PGDay went well beyond our initial expectations. We had about 75 participants, a total of 95 people including staff and speakers.
As I said during the event, rather than PGDay Italy, this should be named PGDay for Italian speakers given the presence of staff from Switzerland (Canton Ticino). Participants came from 12 regions: all regions but Val d'Aosta in the north/centre area, but also from Southern Italy (Naples and Calabria).

Greenplum does not officially support Ubuntu Server 11.10 as underlying operating system. However, I needed to install it on the most recent Ubuntu server just to perform some tests and evaluate it.

Through this article, we are going to complete the MapReduce job started in the previous article.