Friday, June 22

Using Window Functions for Time Series IoT Analytics in Postgres-BDR

Internet of Things tends to generate large volumes of data at a great velocity. Often times this data is collected from geographically distributed sources and aggregated at a central location for data scientists to perform their magic i.e. find patterns, trends and make predictions.

Let’s explore what the IoT Solution using Postgres-BDR has to offer for Data Analytics. Postgres-BDR is offered as an extension on top of PostgreSQL 10 and above. It is not a fork. Therefore, we get the power of complete set of analytic functions that PostgreSQL has to offer. In particular, I am going to play around with PostgreSQL’s Window Functions here to analyze a sample of time series data from temperature sensors.

Let’s take an example of IoT temperature sensor time series data spread over a period of 7 days. In a typical scenario, temperature sensors are sending readings every minute. Some cases could be even more frequent. For the sake of simplicity, however, I am going to use one reading per day. The objective is to use PostgreSQL’s Window Functions for running analytics and that would not change with increasing the frequency of data points.

Here’s our sample data. Again, the number of table fields in a real world IoT temperature sensor would be higher, but our fields of interest in this case are restricted to timestamp of the temperature recording, device that reported it and the actual reading.

CREATE TABLE iot_temperature_sensor_data (
    ts timestamp without time zone,
    device_id text,
    reading float
);

and we add some random timeseries temperature sensor reading data for seven consecutive days

         ts          |            device_id             | reading 

---------------------+----------------------------------+---------

 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10

 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9

 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1

 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3

 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6

 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9

 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9

Some of the very obvious questions of course:

  • What was the lowest temperature reading and when?
  • What was the highest temperature reading and when?

Quite understandably, we can run Min() and MAX() on ‘reading’ column to get lowest and highest temperature readings:

SELECT
    MIN(reading),
    MAX(reading)
FROM
    iot_temperature_sensor_data;
 min | max 
-----+-----  
   1 |  10 
(1 row)

While it is useful, it doesn’t tell us which corresponding dates and/or devices reported lowest and highest temperatures.

SELECT
    ts,
    device_id,
    reading
FROM
    iot_temperature_sensor_data
WHERE
    reading = (
        SELECT
            MIN(reading)
        FROM
            iot_temperature_sensor_data);
          ts          |            device_id             | reading 
 ---------------------+----------------------------------+--------- 
 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 (1 row)

So far so good!

However what about the questions like:
– Where does a given day rank in terms of temperature reading for the given week in a specified order?
– Let’s take a look to see how low or high the temperature was compared to previous or following day. So essentially we are looking to find the delta in temperature readings for a given day compared to one previous and/or following day.

This is where Window Functions come into play allowing us to compare and contrast values in relation to current row.

So if we wanted to find where a given day ranks in terms of its temperature reading with lowest temperature ranked at the top:

SELECT
    ts,
    device_id,
    reading,
    rank()
    OVER (
    ORDER BY
        reading)
FROM
    iot_temperature_sensor_data;
          ts          |            device_id             | reading | rank
 ---------------------+----------------------------------+---------+------
  2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |    1
  2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |    2
  2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |    3
  2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |    4
  2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |    4
  2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |    4
  2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |    7
 (7 rows) 

If we looked at the rank column, it looks good except that we notice rank 7 right after 4. This is because the next rank gets skipped because we have three rows with identical temperature reading of 9. What if I wanted not to skip ? PostgreSQL provides rank_dense() exactly to serve the same purpose.

SELECT
    ts,
    device_id,
    reading,
    dense_rank()
    OVER (
    ORDER BY
        reading)
FROM
    iot_temperature_sensor_data;
         ts          |            device_id             | reading | dense_rank
---------------------+----------------------------------+---------+------------
 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |          1 
 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |          2 
 2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |          3 
 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |          4 
 2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |          4 
 2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |          4 
 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |          5 
(7 rows)

How about which days saw the maximum rise in temperature compared to previous day ?

SELECT
    ts,
    device_id,
    reading,
    reading - lag(reading, 1)
    OVER (
    ORDER BY
        reading) AS diff
FROM
    iot_temperature_sensor_data;

         ts          |            device_id             | reading | diff
---------------------+----------------------------------+---------+------
  2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1 |
  2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3 |    2
  2017-06-05 00:00:00 | d94d599467d1d8d3d347d9e9df809691 |       6 |    3
  2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9 |    3
  2017-06-06 00:00:00 | a9c1e8f60f28935f3510d7d83ba54329 |       9 |    0
  2017-06-07 00:00:00 | 6fb333bd151b4bcc684d21b248c06ca3 |       9 |    0
  2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10 |    1
 (7 rows)

 We see that the last column shows the rise in temperature vs previous day. A quick visual inspection would show that the maximum rise in temperature is 3 degrees on 2017-06-02 00:00:00 and then again on 2017-06-05 00:00:00. With a little bit of CTE magic, we could list the days that saw the maximum rise in temperature.

WITH temperature_data AS (
    SELECT
        ts,
        device_id,
        reading,
        reading - lag(reading, 1)
        OVER (
        ORDER BY
            reading) AS diff
    FROM
        iot_temperature_sensor_data
)
SELECT
    ts,
    diff
FROM
    temperature_data
WHERE
    diff = (
        SELECT
            MAX(diff)
        FROM
            temperature_data);
         ts          | diff 

---------------------+------

 2017-06-05 00:00:00 |    3

 2017-06-02 00:00:00 |    3

Here is a list of Window Functions that PostgreSQL and Postgres-BDR support:

  • row_number()
  • percent_rank()
  • cume_dist()
  • ntile(num_buckets integer)
  • lead(value anyelement [, offset integer [, defaultanyelement ]])
  • first_value(value any)
  • last_value(value any)
  • nth_value(value any, nthinteger)

For further reading, please refer to the PostgreSQL documentation on Window Functions in PostgreSQL.

Leave a Reply

Your email address will not be published. Required fields are marked *