Friday, November 24

More robust collations with ICU support in PostgreSQL 10

In this article, I want to introduce the ICU support in PostgreSQL, which I have worked on for PostgreSQL version 10, to appear later this year.

Sorting

Sorting is an important functionality of a database system. First, users generally want to see data sorted. Any query result that contains more than one row and is destined for end-user consumption will probably want to be sorted, just for a better user experience. Second, a lot of the internal functionality of a database system depends on sorting data or having sorted data available. B-tree indexes are an obvious example. BRIN indexes have knowledge of order. Range partitioning has to compare values. Merge joins depend on sorted input. The idea that is common to these different techniques is that, roughly speaking, if you have sorted data and you know what you are looking for, it makes it much faster to locate the place where it should be found.

There are two important aspects to sorting. One is the sorting algorithm. This is a standard topic in computer science, and a lot of work has gone into PostgreSQL over the years to refine the various sorting algorithms and methods, but that’s not what I will write about. The other is deciding in what order things should be, which is what we call collation. In many cases, that choice is obvious. 1 comes before 2. FALSE comes before TRUE … well, someone just arbitrarily decided that one. A usually comes before B. But when it comes to natural language text, things get interesting. There are many different ways to order text, and the actual methods to collate text strings are more complicated than might be apparent. Different languages prefer different sort orders, but even within a language, there can be variations for different applications. And there are details to worry about, such as what to do about whitespace, punctuation, case differences, diacritic marks, and so on. Look up the Unicode Collation Algorithm for more insight into this.

Before the ICU feature was committed, all this functionality what facilitated by the C library in the operating system. PostgreSQL basically just passes strings to strcmp(), strcoll(), and the like and works with the result. The C libraries in the various operating systems implement the various collation variants and nuances mentioned above to different levels of functionality and quality, so PostgreSQL can do what your operating system can do.

Changing collations

Problems start if the operating system ever needs to change a collation it provides. Why would they want to do that? It could be that the previous collation was wrong and had to be fixed. Maybe a new standard for a language was published and the collation is to be updated for that. Maybe the internal representation of collation and string data was changed for performance reasons or because it was necessary to implement additional functionality. For many programs, this is not an issue. You might just see a slightly differently ordered output, if you notice a difference at all. For a database system, however, this is a major problem. As described above, PostgreSQL stores sorted data in indexes and other places and relies on the sort order to be correct. If the sort order is not correct, an index lookup might not find data that is actually there. Or a write to an index will write to a different place. Or data is written to or read from the wrong partition. This can lead to erroneously duplicate data or the appearance of data loss because data is not where it is looked for. In other words, it can lead to data corruption and (apparent) data loss.

Unfortunately, there was not much we could do about it so far. Operating systems update their collations whenever they feel like it, perhaps as part of an upgrade to their C library package. There is no way to find out about this in a reasonable way, or than perhaps by inspecting the update packages in detail. And even then, will you reject an important update of your C library because you noticed that the collation in some locale you are not using was changed? It was a very uncomfortable situation.

Enter ICU

So where does ICU come in? ICU, International Components for Unicode, is a library that provides internationalization and localization facilities, including collation. So in that respect, it is an alternative to using the facilities in the standard C library. The nice thing is that ICU explicitly provides some guarantees about the stability of collations:

  • A collation will not be changed in an incompatible way as part of a minor release update.
  • A collation has a version, which can be inspected, and when a collation changes in an incompatible way, the version changes.

For users of PostgreSQL, this will mean in practice:

  • Routine operating system package updates will not interfere with the validity of sorted data. Since a postgres binary is linked to a particular major version of libicu, routine operating system package upgrades will not end up with postgres being linked to a new major version of libicu, as long as a) you don’t update the PostgreSQL packages, or b) the PostgreSQL packages are still linked to the same major version of ICU as before. Packagers will need to be careful to maintain this properly, but that shouldn’t be too problematic in practice.
  • When major package and operating system upgrades do change the version of a collation, we have a way to detect that and warn the user. Right now we just warn and offer some guidelines and tools to fix things, but in the future we might refine and automate this further.

(To make this more explicit for packagers: In a stable branch of your operating system, you should not change the major ICU version that a given PostgreSQL package set is linked with.)

Using ICU

To be able to use this, PostgreSQL needs to be built explicitly with ICU support. When building from source, use ./configure --with-icu along with other desired options. We expect most major binary packages to offer this by default as well. When this is done, ICU-based collations are offered alongside the libc-based collations that previous releases offered. (So building with ICU support does not remove libc collation support; the two exist together.) Check the documentation for details on how to select an ICU-based collation versus a libc-based one. For example, if you had previously specified

CREATE TABLE ... (... x text COLLATE "en_US" ...)

you might now do

CREATE TABLE ... (... x text COLLATE "en-x-icu" ...)

This should give you roughly the same user-visible behavior as before, except that your database will be more future-proof when it comes to upgrading. (On Linux/glibc, the sort order should be mostly the same, but there could be small differences in some details. If, however, you are using an operating system whose C library does not support Unicode collation at all, such as macOS or older versions of FreeBSD, then this will be a major change — for the better.)

Currently, ICU support is only available for explicitly specified collations. The default collation in a database is still always provided by the C library. Addressing this is a future project.

If you upgrade such a database by pg_upgrade for example to a new PostgreSQL installation that is linked with a newer major version of ICU that has changed the collation version of that collation you are using, then you will get a warning and will have to fix up for example any indexes that depend on the collation. Instructions for this are also in the documentation.

Abbreviated keys

So this change will provide some very important improvements for long-term robustness of a database system. But ICU is also an improvement over the system C library in other areas.

For example, PostgreSQL B-trees can store what are called abbreviated keys to improve performance and storage. For text string data types, with the standard C library, we would compute these abbreviated keys using the strxfrm() function. However, we have learned that many C libraries have a variety of bugs and misbehaviors that make this approach not reliable. So the abbreviated keys optimization is currently disabled for string data types. With ICU, we can use the equivalent API calls and compute abbreviated keys in what we believe is a reliable and stable way. So there are possible performance improvements from this move as well.

More collations

Apart from these internal improvements of robustness and performance, there is also some new user-facing functionality.

For some languages, more than one sort order might be relevant in practice. (This might get you started.) One example is that for German, there is a standard sort order that is used for most purposes and a “phone book” sort order that is used for lists of names. The standard C library only provides one of those variants (probably the first one). But if you want to write an application that properly sorts, say, both product names and customer names, you need to be able to use both.

For example, the example from the German Wikipedia can now be reproduced with PostgreSQL:

CREATE TABLE names (name text);

INSERT INTO names
    VALUES ('Göbel'), ('Goethe'), ('Goldmann'), ('Göthe'), ('Götz');

=> SELECT name FROM names ORDER BY name COLLATE "de-u-co-standard-x-icu";
   name
----------
 Göbel
 Goethe
 Goldmann
 Göthe
 Götz

=> SELECT name FROM names ORDER BY name COLLATE "de-u-co-phonebk-x-icu";
   name
----------
 Göbel
 Goethe
 Göthe
 Götz
 Goldmann

=> SELECT name FROM names ORDER BY name COLLATE "de-AT-u-co-phonebk-x-icu";
   name
----------
 Goethe
 Goldmann
 Göbel
 Göthe
 Götz

(With glibc, COLLATE "de_DE" and COLLATE "de_AT" indeed return the first order.)

One interesting way to combine several features might be to use domains to model the above mentioned difference between product names and customer names:

CREATE DOMAIN product_name AS text COLLATE "de-u-co-standard-x-icu";
CREATE DOMAIN person_name AS text COLLATE "de-u-co-phonebk-x-icu";

(This is just an example. Of course you can also attach those COLLATE clauses to column definitions directly or use them in queries.)

Even more collations

Finally, and this is clearly what the world had been waiting for, there is now a way to properly sort emojis. This is essential to ensure that all your cat faces are in the proper order. Compare

=# SELECT chr(x) FROM generate_series(x'1F634'::int, x'1F644'::int) AS _(x)
       ORDER BY chr(x) COLLATE "und-x-icu";
 chr
-----
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?

with

=# SELECT chr(x) FROM generate_series(x'1F634'::int, x'1F644'::int) AS _(x)
       ORDER BY chr(x) COLLATE "und-u-co-emoji-x-icu";
 chr
-----
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?
 ?

Yes, there is actually a standard about this.

More to come

This is just the beginning. ICU offers a lot of functionality in this area that we are not exposing through PostgreSQL yet. There are options for case-insensitive sorting, accent-insensitive sorting, and totally customizing a collation. Look for those in future PostgreSQL releases.

9 Comments

  • Proper collation support, independent of O.S is really a nice addition to postgresql.
    This will make easier to port from databases that have different collations.

    I have one Sybase ASE and one SqlServer with collation LATIN1_CI_AI (case insensitive, accent insensitive).

    Is there an estimative when case insensitive and accent insensitive will be supported?

    Keep up the good work.
    Regards,
    lagar84

  • The requirement of tying major versions of ICU and PG (distribution-wide or for a given install) is tricky. For example Gentoo offers multiple versions of ICU, and forcing PG 10.* to depend on icu 58.1 (for example) would be very user-unfriendly. An update to icu will trigger a suggestion to recompile PG, potentially messing things up. Also, what happens when using streaming or logical replication between hosts that differ in icu version ?

    I know this was already a problem before and that switching to icu is a definite improvement, but I hope that warnings about incompatible collations are very hard to miss.

    • Peter Eisentraut

      Well, we need a way to get stable collation data. We can either get that by using libc and never updating it or by using ICU and updating it with certain constraints. What you’re describing just isn’t going to work with currently available tools, unless some new library comes out that maintains previous collation data indefinitely.

      Streaming replication between incompatible collations will have the same issues as described for pg_upgrade. Logical decoding is not a problem because the data will be inserted in the appropriate collation order at the receiving end.

  • I love that you chose to use emoticons (emojis) for the collation example — it’s ideal because it provides an excellent example of the different sorting results.

    Please don’t forget about the number 42 being the statistical constant for “the answer to life, the universe, and everything” (for those who don’t understand this reference, look up the letters D.N.A., which are the initials of Douglas Noel Adams), which could come in handy for use in future examples should your need arise. (Don’t Panic!)

    On a more serious note, it makes sense to me that ICU should be handled by PostgreSQL, particularly since each Operating System probably handles it differently, especially since you pointed out the potential for cross-platform inconsistencies from the strcmp() and strcoll() C functions.

    To me, this serious regard for stability and consistency (which seems to be a long-standing attitude among PostgreSQL developers) is another excellent reason for choosing PostgreSQL.

    • Peter Eisentraut

      It is correct with the hyphen, because the new ICU collations use a naming scheme based on BCP 47, which is a newish standard for naming locales.

Leave a Reply

Your email address will not be published. Required fields are marked *