2ndQuadrant Ltd official blog

Mapreduce in Greenplum 4.1 - 2nd part

| | Comments (0) | TrackBacks (0)

Through this article, we are going to complete the MapReduce job started in the previous article.

Take up the problem from the previous article

In the previous article, we left with this MapReduce configuration file:

%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }

EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

Which produces the following output:

key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)

Naturally speaking, that job sums all values from two different columns of a test table.

Our goal here, is to use execute a division of these two values, in particular 15 and 278.

Let's check what the result is with a calculator, just to be sure that the MapReduce job will return the correct value:

$ psql -c "SELECT 15/278::FLOAT AS result" test_database
       result
 0.0539568345323741
(1 row)

Yes, we use Greenplum as a calculator :).

Introducing "tasks"

What we are doing here is to define a separate task that performs the sum. We will use the result of that task as input for a query that actually does the division step. Let's see it in practice.

  • Remove the EXECUTE part from test.yml. In details, these lines:
EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM
  • Define a task, wich is responsible to execute the sum of x and y values. To do that, it reuses the old map function. Append this to test.yml:
- TASK:
        NAME: sums
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

The useful characteristic of tasks is that they can be used as input for further processing stages.

  • Define the step that performs the division, actually. It is an SQL SELECT that use the task defined earlier as input. Append this to test.yml:
- INPUT:
        NAME: division
        QUERY: |
            SELECT
                (SELECT value FROM sums where key = 'Sum of x') /
                (SELECT value FROM sums where key = 'Sum of y')
                AS final_division;

As you can see, the FROM clause contains the name of the task defined above: sums.

  • Finally, execute the job and displays output. Append this to test.yml:
EXECUTE:
    - RUN:
        SOURCE: division
        TARGET: STDOUT

This step runs the division query and display the result via standard output.

Put everything together

This is the complete test.yml file:

%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }
    - TASK:
        NAME: sums
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

    - INPUT:
        NAME: division
        QUERY: |
            SELECT
                (SELECT value FROM sums where key = 'Sum of x') /
                (SELECT value FROM sums where key = 'Sum of y')
                AS final_division;

EXECUTE:
    - RUN:
        SOURCE: division
        TARGET: STDOUT

Execute the whole job with:

$ gpmapreduce -f test.yml
mapreduce_2235_run_1
    final_division
0.0539568345323741
(1 row)

Compare it with the calculator result. Ok, it matches.

Conclusion

The task is complete. We have calculated sum(x)/sum(y) correctly.

The power of MapReduce is mainly in the number of servers involved in the calculation.

Many servers accomplishes small calculation to get the final result. Maybe you will not notice the powerful of MapReduce here, but this is a good starting point.

0 TrackBacks

Listed below are links to blogs that reference this entry: Mapreduce in Greenplum 4.1 - 2nd part.

TrackBack URL for this entry: http://blog.2ndquadrant.it/mt/mt-tb.cgi/182

Leave a comment