More robust collations with ICU support in PostgreSQL 10

May 22, 2017/20 Comments/in Eisentraut's PlanetPostgreSQL, PostgreSQL /by Peter Eisentraut

In this article, I want to introduce the ICU support in PostgreSQL, which I have worked on for PostgreSQL version 10, to appear later this year.

Sorting

Sorting is an important functionality of a database system. First, users generally want to see data sorted. Any query result that contains more than one row and is destined for end-user consumption will probably want to be sorted, just for a better user experience. Second, a lot of the internal functionality of a database system depends on sorting data or having sorted data available. B-tree indexes are an obvious example. BRIN indexes have knowledge of order. Range partitioning has to compare values. Merge joins depend on sorted input. The idea that is common to these different techniques is that, roughly speaking, if you have sorted data and you know what you are looking for, it makes it much faster to locate the place where it should be found.

There are two important aspects to sorting. One is the sorting algorithm. This is a standard topic in computer science, and a lot of work has gone into PostgreSQL over the years to refine the various sorting algorithms and methods, but that’s not what I will write about. The other is deciding in what order things should be, which is what we call collation. In many cases, that choice is obvious. 1 comes before 2. FALSE comes before TRUE … well, someone just arbitrarily decided that one. A usually comes before B. But when it comes to natural language text, things get interesting. There are many different ways to order text, and the actual methods to collate text strings are more complicated than might be apparent. Different languages prefer different sort orders, but even within a language, there can be variations for different applications. And there are details to worry about, such as what to do about whitespace, punctuation, case differences, diacritic marks, and so on. Look up the Unicode Collation Algorithm for more insight into this.

Before the ICU feature was committed, all this functionality what facilitated by the C library in the operating system. PostgreSQL basically just passes strings to strcmp(), strcoll(), and the like and works with the result. The C libraries in the various operating systems implement the various collation variants and nuances mentioned above to different levels of functionality and quality, so PostgreSQL can do what your operating system can do.

Changing collations

Problems start if the operating system ever needs to change a collation it provides. Why would they want to do that? It could be that the previous collation was wrong and had to be fixed. Maybe a new standard for a language was published and the collation is to be updated for that. Maybe the internal representation of collation and string data was changed for performance reasons or because it was necessary to implement additional functionality. For many programs, this is not an issue. You might just see a slightly differently ordered output, if you notice a difference at all. For a database system, however, this is a major problem. As described above, PostgreSQL stores sorted data in indexes and other places and relies on the sort order to be correct. If the sort order is not correct, an index lookup might not find data that is actually there. Or a write to an index will write to a different place. Or data is written to or read from the wrong partition. This can lead to erroneously duplicate data or the appearance of data loss because data is not where it is looked for. In other words, it can lead to data corruption and (apparent) data loss.

Unfortunately, there was not much we could do about it so far. Operating systems update their collations whenever they feel like it, perhaps as part of an upgrade to their C library package. There is no way to find out about this in a reasonable way, or than perhaps by inspecting the update packages in detail. And even then, will you reject an important update of your C library because you noticed that the collation in some locale you are not using was changed? It was a very uncomfortable situation.

Enter ICU

So where does ICU come in? ICU, International Components for Unicode, is a library that provides internationalization and localization facilities, including collation. So in that respect, it is an alternative to using the facilities in the standard C library. The nice thing is that ICU explicitly provides some guarantees about the stability of collations:

A collation will not be changed in an incompatible way as part of a minor release update.
A collation has a version, which can be inspected, and when a collation changes in an incompatible way, the version changes.

For users of PostgreSQL, this will mean in practice:

Routine operating system package updates will not interfere with the validity of sorted data. Since a postgres binary is linked to a particular major version of libicu, routine operating system package upgrades will not end up with postgres being linked to a new major version of libicu, as long as a) you don’t update the PostgreSQL packages, or b) the PostgreSQL packages are still linked to the same major version of ICU as before. Packagers will need to be careful to maintain this properly, but that shouldn’t be too problematic in practice.
When major package and operating system upgrades do change the version of a collation, we have a way to detect that and warn the user. Right now we just warn and offer some guidelines and tools to fix things, but in the future we might refine and automate this further.

(To make this more explicit for packagers: In a stable branch of your operating system, you should not change the major ICU version that a given PostgreSQL package set is linked with.)

Using ICU

To be able to use this, PostgreSQL needs to be built explicitly with ICU support. When building from source, use ./configure --with-icu along with other desired options. We expect most major binary packages to offer this by default as well. When this is done, ICU-based collations are offered alongside the libc-based collations that previous releases offered. (So building with ICU support does not remove libc collation support; the two exist together.) Check the documentation for details on how to select an ICU-based collation versus a libc-based one. For example, if you had previously specified

CREATE TABLE ... (... x text COLLATE "en_US" ...)

you might now do

CREATE TABLE ... (... x text COLLATE "en-x-icu" ...)

This should give you roughly the same user-visible behavior as before, except that your database will be more future-proof when it comes to upgrading. (On Linux/glibc, the sort order should be mostly the same, but there could be small differences in some details. If, however, you are using an operating system whose C library does not support Unicode collation at all, such as macOS or older versions of FreeBSD, then this will be a major change — for the better.)

Currently, ICU support is only available for explicitly specified collations. The default collation in a database is still always provided by the C library. Addressing this is a future project.

If you upgrade such a database by pg_upgrade for example to a new PostgreSQL installation that is linked with a newer major version of ICU that has changed the collation version of that collation you are using, then you will get a warning and will have to fix up for example any indexes that depend on the collation. Instructions for this are also in the documentation.

Abbreviated keys

So this change will provide some very important improvements for long-term robustness of a database system. But ICU is also an improvement over the system C library in other areas.

For example, PostgreSQL B-trees can store what are called abbreviated keys to improve performance and storage. For text string data types, with the standard C library, we would compute these abbreviated keys using the strxfrm() function. However, we have learned that many C libraries have a variety of bugs and misbehaviors that make this approach not reliable. So the abbreviated keys optimization is currently disabled for string data types. With ICU, we can use the equivalent API calls and compute abbreviated keys in what we believe is a reliable and stable way. So there are possible performance improvements from this move as well.

More collations

Apart from these internal improvements of robustness and performance, there is also some new user-facing functionality.

For some languages, more than one sort order might be relevant in practice. (This might get you started.) One example is that for German, there is a standard sort order that is used for most purposes and a “phone book” sort order that is used for lists of names. The standard C library only provides one of those variants (probably the first one). But if you want to write an application that properly sorts, say, both product names and customer names, you need to be able to use both.

For example, the example from the German Wikipedia can now be reproduced with PostgreSQL:

CREATE TABLE names (name text);

INSERT INTO names
    VALUES ('Göbel'), ('Goethe'), ('Goldmann'), ('Göthe'), ('Götz');

=> SELECT name FROM names ORDER BY name COLLATE "de-u-co-standard-x-icu";
   name
----------
 Göbel
 Goethe
 Goldmann
 Göthe
 Götz

=> SELECT name FROM names ORDER BY name COLLATE "de-u-co-phonebk-x-icu";
   name
----------
 Göbel
 Goethe
 Göthe
 Götz
 Goldmann

=> SELECT name FROM names ORDER BY name COLLATE "de-AT-u-co-phonebk-x-icu";
   name
----------
 Goethe
 Goldmann
 Göbel
 Göthe
 Götz

(With glibc, COLLATE "de_DE" and COLLATE "de_AT" indeed return the first order.)

One interesting way to combine several features might be to use domains to model the above mentioned difference between product names and customer names:

CREATE DOMAIN product_name AS text COLLATE "de-u-co-standard-x-icu";
CREATE DOMAIN person_name AS text COLLATE "de-u-co-phonebk-x-icu";

(This is just an example. Of course you can also attach those COLLATE clauses to column definitions directly or use them in queries.)

Even more collations

Finally, and this is clearly what the world had been waiting for, there is now a way to properly sort emojis. This is essential to ensure that all your cat faces are in the proper order. Compare

=# SELECT chr(x) FROM generate_series(x'1F634'::int, x'1F644'::int) AS _(x)
       ORDER BY chr(x) COLLATE "und-x-icu";
 chr
-----
 😴
 😵
 😶
 😷
 😸
 😹
 😺
 😻
 😼
 😽
 😾
 😿
 🙀
 🙁
 🙂
 🙃
 🙄

with

=# CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');
=# SELECT chr(x) FROM generate_series(x'1F634'::int, x'1F644'::int) AS _(x)
       ORDER BY chr(x) COLLATE "und-u-co-emoji-x-icu";
 chr
-----
 🙂
 🙃
 😶
 🙄
 😴
 😷
 😵
 🙁
 😺
 😸
 😹
 😻
 😼
 😽
 🙀
 😿
 😾

Yes, there is actually a standard about this.

More to come

This is just the beginning. ICU offers a lot of functionality in this area that we are not exposing through PostgreSQL yet. There are options for case-insensitive sorting, accent-insensitive sorting, and totally customizing a collation. Look for those in future PostgreSQL releases.

20 replies

lagar84 says:
May 22, 2017 at 2:40 pm

Proper collation support, independent of O.S is really a nice addition to postgresql.
This will make easier to port from databases that have different collations.

I have one Sybase ASE and one SqlServer with collation LATIN1_CI_AI (case insensitive, accent insensitive).

Is there an estimative when case insensitive and accent insensitive will be supported?

Keep up the good work.
Regards,
lagar84
Reply
- Peter Eisentraut says:
  May 22, 2017 at 8:24 pm
  
  PostgreSQL 11 or 12 seem plausible.
  Reply
moltonel says:
May 22, 2017 at 3:21 pm

The requirement of tying major versions of ICU and PG (distribution-wide or for a given install) is tricky. For example Gentoo offers multiple versions of ICU, and forcing PG 10.* to depend on icu 58.1 (for example) would be very user-unfriendly. An update to icu will trigger a suggestion to recompile PG, potentially messing things up. Also, what happens when using streaming or logical replication between hosts that differ in icu version ?

I know this was already a problem before and that switching to icu is a definite improvement, but I hope that warnings about incompatible collations are very hard to miss.
Reply
- Peter Eisentraut says:
  May 22, 2017 at 8:29 pm
  
  Well, we need a way to get stable collation data. We can either get that by using libc and never updating it or by using ICU and updating it with certain constraints. What you’re describing just isn’t going to work with currently available tools, unless some new library comes out that maintains previous collation data indefinitely.
  
  Streaming replication between incompatible collations will have the same issues as described for pg_upgrade. Logical decoding is not a problem because the data will be inserted in the appropriate collation order at the receiving end.
  Reply
Randolf Richardson says:
May 29, 2017 at 3:38 am

I love that you chose to use emoticons (emojis) for the collation example — it’s ideal because it provides an excellent example of the different sorting results.

Please don’t forget about the number 42 being the statistical constant for “the answer to life, the universe, and everything” (for those who don’t understand this reference, look up the letters D.N.A., which are the initials of Douglas Noel Adams), which could come in handy for use in future examples should your need arise. (Don’t Panic!)

On a more serious note, it makes sense to me that ICU should be handled by PostgreSQL, particularly since each Operating System probably handles it differently, especially since you pointed out the potential for cross-platform inconsistencies from the strcmp() and strcoll() C functions.

To me, this serious regard for stability and consistency (which seems to be a long-standing attitude among PostgreSQL developers) is another excellent reason for choosing PostgreSQL.
Reply
Basil Bourque says:
July 24, 2017 at 11:02 pm

Shouldn’t that be “en_US-x-icu” rather than “en-x-icu” for consistency of your example? I’m looking at:
http://userguide.icu-project.org/locale
Reply
- Peter Eisentraut says:
  July 31, 2017 at 2:13 pm
  
  It is correct with the hyphen, because the new ICU collations use a naming scheme based on BCP 47, which is a newish standard for naming locales.
  Reply
Basil Bourque says:
August 13, 2017 at 8:39 pm

Can you provide a `CREATE DATABASE` example of creating a database that uses the ICU collations? I tried and failed. See [my Question on dba.StackExchange.com](https://dba.stackexchange.com/q/183355/19079).
Reply
- Peter Eisentraut says:
  August 14, 2017 at 3:51 pm
  
  As was answered on Stack Overflow, this is currently not supported.
  Reply
Brun says:
November 25, 2017 at 1:20 pm

Hello

When can i found collation “und-u-co-emoji” ?
Is it possible to add it ?
This collation seems to be missing in my distro (debian-9/ubuntu-16.04)

Best regards
Reply
- Peter Eisentraut says:
  November 28, 2017 at 8:53 pm
  
  The final implementation was changed since the blog post. You now need to create that collation yourself, like CREATE COLLATION xxx (provider = icu, locale = ‘und-u-co-emoji’);
  Reply
Jon says:
October 25, 2018 at 8:34 pm

PostgreSQL 11 is out now. Is it possible to configure a database to do case insensitive searches now?
Reply
axel says:
February 13, 2019 at 9:53 pm

ICU is an awesome feature. We’re trying to figure out how we can convert an existing database to default to an ICU collation (by creating a new DB) then exporting and reimporting from old DB to new. I guess the ability to create a DB with default collation being an ICU collation – is not possible yet. Any other suggestions for converting – or is the only choice leave the default collation as is and specific the collation at index creation and along with queries as well?
Reply
- craig.ringer says:
  May 16, 2019 at 1:33 am
  
  My understanding is that support for ICU at CREATE DATABASE time did not make it into PostgreSQL 10. You’d need to systematically alter your schema, specifying COLLATE clauses on the columns and in some cases like expression indexes, in expressions. I’m not sure there’s any sensible automatic way to do this at this point.
  
  Consider reaching out and seeing if you can help get database level ICU support into the next release.
  Reply
SQLpro says:
March 11, 2020 at 3:06 pm

ICU collations have some major defaults in PostGreSQL. As an example using a CI collation for french language resulting in error when using a LIKE operator !
Reply
- Peter Eisentraut says:
  March 24, 2020 at 11:17 am
  
  Yes, this restriction is documented here: https://www.postgresql.org/docs/12/collation.html (at the bottom of the page)
  Reply
  - James says:
    June 12, 2020 at 3:18 am
    
    Is there an estimative when case insensitive and accent insensitive will be supported?
    (It isn’t in the new feature s of pg 13.1 beta)
    thank you.
    Reply
    - Peter Eisentraut says:
      June 16, 2020 at 3:34 am
      
      Case insensitive and accent insensitive collations are supported in PostgreSQL 12.
      Reply
      - James says:
        June 16, 2020 at 6:49 am
        
        But It doesn’t support compare and like operator in where clause. Could you tell me which version will support case insensitive compare ?
        
        Thanks.
      - Peter Eisentraut says:
        June 16, 2020 at 7:49 am
        
        PostgreSQL 12 supports case-insensitive comparisons. The LIKE operator is not supported with case-insensitive collations, and there are no current plans to change that.