2ndQuadrant Ltd official blog

Mapreduce in Greenplum 4.1

| | Comments (0) | TrackBacks (0)

Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004. It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article. This is a simple introduction to the mapreduce usage in Greenplum 4.1.

What is mapreduce exactly?

Mapreduce's main goal is to process highly distributable problems across huge datasets using a large number of computers (nodes). As you may understand, this suits perfectly with Greenplum, which is at ease with huge distributed datasets and allows the integration with the SQL language.

Mapreduce consists of two separate steps: Map and Reduce.

Map step

During this step, the main problem is partitioned into smaller sub-problems that are passed to children nodes, recursively. This process leads to a multi-level tree structure.

Reduce step

During this step, all the sub-problems solutions are merged to obtain the solution to the initial problem.

Install Mapreduce in Greenplum

Do you think installing Mapreduce in Greenplum is a difficult task? The answer is no. Mapreduce is already included in Greeplum!

Let's get practical

I assume that you have a Greenplum 4.1 system installed.

Run gpmapreduce --version:

$ gpmapreduce --version
gpmapreduce - Greenplum Map/Reduce Driver 1.00b2

Perfect, we can go on.

The main point here is a specially formatted file, that we will convetionally call it test.yml from now on.

As you may guess, that is a YAML file, which defines all parts that are needed by the mapreduce data flow to complete:

  • Input Data
  • Map Function
  • Reduce Function
  • Output Data

The Greenplum MapReduce specification file has a specific YAML schema. I invite you to have a look at the AdminGuide for details. In particular, MapReduce is handled in Chapter 23.

For the sake of this article, we will focus on function definitions.

Let's start writing a text file named test.yml with the mandatory header:

%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: dbname
USER: gpadmin
HOST: host

where and are the name of the database and the host where MapReduce will connect to.

Input Data

Input Data can be obtained in so many ways, in this example we will use an SQL SELECT statement. Let's create a table in database to get data from:

$ psql -c "CREATE TABLE mydata AS SELECT i AS x,
     floor(random()*100) AS y FROM generate_series(1,5) i" 

This will create a 5 rows table with this structure:

=# d mydata
         Table "public.mydata"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 x      | integer          | 
 y      | double precision | 
Distributed by: (x)

The set of rows of this table is our Input Data. Let's define it in the MapReduce configuration file, by appending this to test.yml:

DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM mydata

That is self-explanatory, it just selects all rows from the mydata table as input data for mapreduce.

Map Function

It is very important to understand that a Map function takes as input a single row, and produces zero or more output rows. Map functions can be written in C, Perl or Python. They reside directly in the YAML configuration file.

Parameters managment varies between programming languages (please consult AdminGuide for details). Let's see an example of a map function written in Python. You can append the following to test.yaml:

- MAP:
        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [x integer, y float]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }

As you can see, function source is placed directly in the YAML configuration file. The function takes x and y as input and returns (yield) x and the sum of x and y.

The Reduce step

Reduce functions takes a set of rows in input and produces a single reduced row. There are several predefined functions included in Greenplum.

Here's the list:

  • IDENTITY - returns (key, value) pairs unchanged
  • SUM - calculates the sum of numeric data
  • AVG - calculates the average of numeric data
  • COUNT - calculates the count of input data
  • MIN - calculates minimum value of numeric data
  • MAX - calculates maximum value of numeric data

Let's apply a REDUCE function to our input data, so append this at test.yml:

EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

This return values unchanged. It is not very useful practically, but it is enough to show the Reduce step in action and get you started.

Ok, let's see the complete test.yml:

%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }

EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

Remember that YAML does not use TABS!

It is now possible to execute this Mapreduce job simply running:

$ gpmapreduce -f test.yaml

Results here will most likely be different from yours, due to the usage of the random() function during data generation. Here's mine:

mapreduce_2508_run_1
key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)

Exactly the sum of all x and y values from input table mydata.

In conclusion, this is just a smattering of how MapReduce works in Greenplum. MapReduce is a complex and wide topic, and its usage is growing in popularity every day.

Greenplum has an excellent support of it and allows business analytics users to take advantage of the shared nothing architecture by executing map/reduce functions in a distributed way and by working on distributed datasets.

0 TrackBacks

Listed below are links to blogs that reference this entry: Mapreduce in Greenplum 4.1.

TrackBack URL for this entry: http://blog.2ndquadrant.it/mt/mt-tb.cgi/179

Leave a comment