Wednesday, October 24

XMLTABLE in PostgreSQL

I just committed a patch by Pavel Stěhule that adds the XMLTABLE functionality to PostgreSQL 10.  XMLTABLE is a very useful feature dictated by the SQL/XML standard, that lets you turn your XML data into relational form, so that you can mix it with the rest of your relational data. This feature has many uses; keep reading for some details on it.

Probably the most interesting use case of XMLTABLE is to extract data from some XML document to insert into a relational table, during in-database ETL processing. However, XMLTABLE can be used on-the-fly on data stored in XML columns, so that once the data is in relational form, you can apply any standard operations you want, such as adding WHERE clauses, doing aggregations, joining to other tables, and so on.

A Simple Example

As an example, let’s suppose you administer a hotel chain, and that the data is stored thusly:

CREATE TABLE hoteldata AS SELECT xml
$$<hotels>
 <hotel id="mancha">
  <name>La Mancha</name>
  <rooms>
   <room id="201"><capacity>3</capacity><comment>Great view of the Channel</comment></room>
   <room id="202"><capacity>5</capacity></room>
  </rooms>
  <personnel>
   <person id="1025">
    <name>Ferdinando Quijana</name><salary currency="PTA">45000</salary>
   </person>
  </personnel>
 </hotel>
  <hotel id="valpo">
  <name>Valparaíso</name>
  <rooms>
   <room id="201"><capacity>2</capacity><comment>Very noisy</comment></room>
   <room id="202"><capacity>2</capacity></room>
  </rooms>
  <personnel>
   <person id="1026"><name>Katharina Wuntz</name><salary currency="EUR">50000</salary></person>
   <person id="1027"><name>Diego Velázquez</name><salary currency="CLP">1200000</salary></person>
  </personnel>
 </hotel>
</hotels>$$ AS hotels;

With XMLTABLE, you can turn this into a relationally-formatted table consisting of room numbers and capacity, annotating for each hotel in your chain:

SELECT xmltable.*
  FROM hoteldata,
       XMLTABLE ('/hotels/hotel/rooms/room' PASSING hotels
                 COLUMNS
                    id FOR ORDINALITY,
                    hotel_name text PATH '../../name' NOT NULL,
                    room_id int PATH '@id' NOT NULL,
                    capacity int,
                    comment text PATH 'comment' DEFAULT 'A regular room'
                );
id hotel_name room_id capacity comment
1 La Mancha 201 3 Great view of the Channel
2 La Mancha 202 5 A regular room
3 Valparaíso 201 2 Very noisy
4 Valparaíso 202 2 A regular room

Explaining the syntax

Let’s study the query above. The XMLTABLE clause must go in the FROM part of the query. We also have hoteldata in the FROM, which is what feeds the data into XMLTABLE.

First, the PASSING clause is where we specify the XML data that we want to process. In this case, the data comes from the hotels column in the hoteldata table. We call this the document expression.

Just before the PASSING clause you see an XPath expression '/hotels/hotel/rooms/room'. We call this the row-generating expression or just the row expression.

We have the COLUMNS clause next, declaring a few columns. For each column we indicate a data type as well as an optional PATH clause, which we call the column expression.

XMLTABLE‘s theory of operation is that the row expression is applied to the document expression, slicing the document into pieces to generate rows; for each row so generated, the various column expressions are applied to obtain the values for each column.

The column expression is an XPath expression that obtains a value starting from the XML for the current row. If no PATH is specified, then the column name itself is used as the XPath expression. Note that in the column hotel_name we used a path having “../“, which means to “go up” in the XML document to grab values from the the “container” objects in the document. We can also use xml PATH '.' in a row, which gives us the full source XML for that row.

One column can be marked FOR ORDINALITY. The column is then of type INTEGER, and is numbered sequentially for each row obtained from the document. (If there are multiple input documents, such as when you have multiple rows in a table, the counter starts from 1 for each new document).

There is also a DEFAULT clause. If the XPath for a column does not match a value for a certain row, then the DEFAULT value is used.

Some of these columns have been marked as NOT NULL. If there is no match, and no DEFAULT clause is specified (or the DEFAULT also evaluates to NULL), an error is thrown.

I won’t go into more detail of XPath, which is a powerful language, but I can offer the XPath article in Wikipedia, and the official recommendation document from W3C as useful resources.

The full XMLTABLE syntax

The documented syntax synopsis is:

xmltable( [XMLNAMESPACES(namespace uri AS namespace name[, ...])] row_expression PASSING [BY REF] document_expression [BY REF] COLUMNS name { type [PATH column_expression] [DEFAULT expr] [NOT NULL | NULL] | FOR ORDINALITY } [, ...] )

Note that the document expression can be a reference to some a table you have in the FROM clause, or it can be a complete XML document as a string literal. The BY REF clauses don’t have any effect; they are there for compatibility with the standard and with other database systems.

I have not covered the XMLNAMESPACES clause in this post; I’m leaving that for a future installment.

Applying SQL on top

As mentioned, once XMLTABLE has processed the data into relational form, you can do whatever you want using well-known tools. For instance, if you had another XML document with more personnel in each hotel,

INSERT INTO hoteldata VALUES (xml
$$<hotels> 
 <hotel id="mancha">
  <name>La Mancha</name>
  <personnel>
   <person id="1028">
    <name>Sancho Panza</name><salary currency="PTA">35000</salary>
   </person>
  </personnel>
 </hotel>
 <hotel id="valpo">
  <name>Valparaíso</name>
  <personnel>
   <person id="1029"><name>Kurt Werner</name><salary currency="EUR">30000</salary></person>
  </personnel>
 </hotel>
</hotels>$$);

It’s easy to obtain the total salaries for each currency you need to pay at each hotel,

  SELECT hotel, currency, sum(salary)
    FROM hoteldata,
XMLTABLE ('/hotels/hotel/personnel/person' PASSING hotels
       COLUMNS hotel text PATH '../../name' NOT NULL,
               salary integer PATH 'salary' NOT NULL,
               currency text PATH 'salary/@currency' NOT NULL
) GROUP BY hotel, currency;
hotel currency sum
Valparaíso CLP 1200000
Valparaíso EUR 80000
La Mancha PTA 80000

Conclusion

In this article, I have covered the new feature XMLTABLE to appear in PostgreSQL version 10. I think XMLTABLE is a great feature for integrating external data, and I hope you find it valuable too. Please test it and report any problems, so that we can sort them out before the final release. If you like XMLTABLE, be sure to let us know leaving a comment!

13 Comments

    • alvherre

      Yes, thanks!

      One thing I noticed in your patch is that multiple FOR ORDINALITY columns are allowed, but this framework only supports one (because that’s what the standard requires). Does JSON_TABLE require more than one?

  • Goo

    COLUMNS
    dn text PATH ‘MO/DN’,

    ERROR: more than one value returned by column XPath expression

    Contains more than one, Can you choose the first one?

    • alvherre

      Sure, there’s XPath syntax for array indexes. As I recall, it would be `PATH ‘MO/DN[1]’`, or maybe it could be `PATH ‘MO[1]/DN’`, depending on where the array is. I don’t know what a good tutorial for XPath is, but the wikipedia page seems like a decent starting point: https://en.wikipedia.org/wiki/XPath

  • puqun

    when i use xmltable cross join xmltable,
    As a result of the following examples, I am not sure whether this result is correct or not.
    ex:
    CREATE TABLE xmltest AS SELECT
    xml $$

    $$ AS data;

    SELECT xmltable.*
    FROM xmltest,
    XMLTABLE (XMLNAMESPACES(‘http://example.com/myns’ AS x,
    ‘http://example.com/b’ AS “B”),
    ‘/x:example/x:item’
    PASSING by ref(SELECT data FROM xmltest)
    COLUMNS
    id FOR ORDINALITY,
    foo int PATH ‘@foo’,
    bar int PATH ‘@B:bar’,
    name text DEFAULT ‘not specified’) CROSS JOIN
    XMLTABLE (XMLNAMESPACES(‘http://example.com/myns’ AS x,
    ‘http://example.com/b’ AS “B”),
    ‘/x:example/x:item’
    PASSING by ref(SELECT data FROM xmltest)
    COLUMNS
    id FOR ORDINALITY,
    foo int PATH ‘@foo’,
    bar int PATH ‘@B:bar’,
    name text DEFAULT ‘not specified’)t2;

    The result1:
    id | foo | bar | name
    —-+—–+—–+—————
    1 | 100 | 2 | not specified
    1 | 100 | 2 | not specified
    1 | 100 | 2 | not specified
    1 | 100 | 2 | not specified
    2 | 300 | 4 | not specified
    2 | 300 | 4 | not specified
    2 | 300 | 4 | not specified
    2 | 300 | 4 | not specified
    3 | 400 | 5 | not specified
    3 | 400 | 5 | not specified
    3 | 400 | 5 | not specified
    3 | 400 | 5 | not specified
    4 | 300 | 5 | not specified
    4 | 300 | 5 | not specified
    4 | 300 | 5 | not specified
    4 | 300 | 5 | not specified
    (16 rows)

    But, when I cross join two common tables of the same data as xmltable ,the result2:
    id | foo | bar | name | id | foo | bar | name
    —-+—–+—–+—————+—-+—–+—–+—————
    1 | 100 | 2 | not specified | 1 | 100 | 2 | not specified
    1 | 100 | 2 | not specified | 2 | 300 | 4 | not specified
    1 | 100 | 2 | not specified | 3 | 400 | 5 | not specified
    1 | 100 | 2 | not specified | 4 | 300 | 5 | not specified
    2 | 300 | 4 | not specified | 1 | 100 | 2 | not specified
    2 | 300 | 4 | not specified | 2 | 300 | 4 | not specified
    2 | 300 | 4 | not specified | 3 | 400 | 5 | not specified
    2 | 300 | 4 | not specified | 4 | 300 | 5 | not specified
    3 | 400 | 5 | not specified | 1 | 100 | 2 | not specified
    3 | 400 | 5 | not specified | 2 | 300 | 4 | not specified
    3 | 400 | 5 | not specified | 3 | 400 | 5 | not specified
    3 | 400 | 5 | not specified | 4 | 300 | 5 | not specified

    So, I want to know,
    The result1 is right or not?

    • alvherre

      Sorry, I didn’t understand your XML very well — apparently all the XML markup has been removed by wordpress. Maybe you can post the question elsewhere, say [email protected] or StackOverflow? If you then drop a link here I can try to reply. (Or maybe in WordPress it is possible to markup XML properly for a comment? I don’t know.) Thanks.

  • David Day

    I was hoping to get some advise and potentially a solution to my problem.

    This example is just showing the concept of what we’re trying to do as the XML structure we hold can be over 2,000 lines.

    In Oracle we’re migrating our code into postgres so this is how we’re currently doing it in Oracle and what we’re trying to migrate into postgres. One of the problems we’re hitting is using MULTIPLE XMLTABLE functions to link nodes with a single XML tag elements to one’s with multiple tag elements as show in this example. The problem with the XMLTABLE in postgres we don’t seem to be able to pass structures from one XMLTABLE to another to structure the output as we require.

    Here is a test case example:-

    CREATE TABLE XML_TABLE (
    id NUMERIC(19,0) NOT NULL,
    resultxml XML
    );

    INSERT into XML_TABLE(ID,RESULTXML)
    VALUES
    (1,’

    AA
    XX

    test item 1
    29.9

    test item 2
    30.9

    BB
    ZZ

    test item 1
    30.9

    test item 2
    40.9

    CC
    YY

    test item 1
    50.9

    test item 2
    6-.9

    ‘);

    SELECT xt.id ,
    xt1.RESULT_POS,
    xt1.product,
    xt1.name,
    xt2.item_pos,
    xt2.item_text,
    xt2.item_value
    FROM TEST_LOAD.XML_TABLE xt,
    XMLTABLE(‘//storedresults/result’
    PASSING xt.resultxml
    COLUMNS RESULT_POS FOR ORDINALITY,
    PRODUCT CHARACTER VARYING(20) path ‘product’,
    NAME CHARACTER VARYING(20) path ‘name’,
    ITEMS_XML XML PATH ‘//items/item’) xt1,
    XMLTABLE(‘item’
    PASSING xt1.ITEMS_XML
    COLUMNS
    ITEM_POS FOR ORDINALITY,
    ITEM_TEXT CHARACTER VARYING(300) PATH ‘text’,
    ITEM_VALUE CHARACTER VARYING(300) PATH ‘value’) xt2
    WHERE xt.id = 1;

    I get the following error.

    ERROR: could not parse XML document
    DETAIL: line 4: Extra content at the end of the document

    In Oracle it works fine doing this so just wondering if there is another way of doing this in Postgres that is not going to be massive on performance consumption as our the LIVE solution is dealing with high volumes of data which extract data from a large XML structure. We have over 20 extracts from the same XML structure for different components. The postion element is key to make our record unique for a combination of fields we use for the primary key. This is to support CDC from the extract data tables we’ve created and keep uploading new data into.

    Thanks
    David

  • Ivan Ustûžanin

    For those who like me stumbled on this post googling about XMLTABLE throwing “unexpected XPath object type 3” (or “type 2” for the matter): the gist is that XMLTABLE *does NOT* handle any XPath types except nodesets and strings (I’ve looked at the source code). Use XPath string() function to work around this issue. By the way type 2 is a boolean value and type 3 is a number.

    • Ivan Ustûžanin

      I also noticed that despite a column having an array type a nodeset returned for column containing more that one node also throws the error “Cardinality violation: 7 ERROR: the XPath expression for a column returned more than one value” (sorry, it’s a back-translation from Russian).

  • Anand

    Hi,
    Thanks for explanation. We have an application which is hosted on Postgres release 9.6.4 and we need XMLTable & XMLNamespaces functionality to process big xml files to flush data into tables.

    Is it possible back-port these features into old release? Can I take underlying source code and deploy the code into 9.6.3 release?

    Thanks for your response.

Leave a Reply

Your email address will not be published. Required fields are marked *