<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>2ndQuadrant, Professional PostgreSQL</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/" />
    <link rel="self" type="application/atom+xml" href="http://blog.2ndquadrant.com/en/atom.xml" />
    <id>tag:blog.2ndquadrant.com,2009-06-22:/en//3</id>
    <updated>2012-04-13T09:05:25Z</updated>
    <subtitle>2ndQuadrant Ltd official blog</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.12</generator>

<entry>
    <title>External web tables in Greenplum</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2012/04/greenplums-external-web-tables.html" />
    <id>tag:blog.2ndquadrant.com,2012:/en//3.194</id>

    <published>2012-04-10T10:12:00Z</published>
    <updated>2012-04-13T09:05:25Z</updated>

    <summary>External web tables are one of the most useful features when you you have to load data into a Greenplum database from different sources....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="externaltables" label="External tables" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="greenplum42" label="Greenplum 4.2" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p><strong>External web tables</strong> are one of the most useful features when you you have to load
data into a Greenplum database from different sources.</p>
]]>
        <![CDATA[<h2>What is an external table?</h2>

<p>External web tables are a special type of external tables.</p>

<p>The keyword <em>web</em> means that they are able to access dynamic data, and they can show it to you as if they were regular database tables.</p>

<p>Given that data could change during every single query execution involving the web table, the Greenplum planner must avoid to choose plans that perform a complete re-scanning of the whole table.</p>

<h2>How to use them</h2>

<p>There are two different types of web tables: Web URLs based and OS Command based.</p>

<p>With Web URL ones it is possible to create an external web table that loads data from
files on a web server. The data transmission relies on the HTTP protocol.</p>

<p>With the OS Command mode you can specify a shell script (or a command) to execute on any number of your cluster's segments. The output of your script will be the data of the web table at the time of access. The command is executed in parallel on all segments by default, but you can limit that to the master node for instance.</p>

<p>In this article I have used the OS command mode.</p>

<h2>A practical scenario</h2>

<p>Say we want to load PostgreSQL data into Greenplum using web tables.</p>

<p>First of all, we need to write a script that outputs the desired data.
To achieve that, we will use the power of the COPY command.
An example script would be:</p>

<pre><code>#!/bin/bash

COMMAND="COPY table_name(field) TO STDOUT WITH (FORMAT CSV, DELIMITER '|', HEADER)"

psql -U user -d database -c "$COMMAND"
</code></pre>

<p>This prints the content of the <em>field</em> column of the <em>table_name</em> table on <em>standard output</em>  in CSV format. In this example we will run the script on the master. Theoretically, you could copy the script on all segments, each collecting data from partitioned tables in parallel.</p>

<p><strong>IMPORTANT:</strong> the script must be executable by the <code>gpadmin</code> user.</p>

<p>Now it is possible to create the corresponding external web table in Greenplum:</p>

<pre><code># CREATE EXTERNAL WEB TABLE ext_table ( field TEXT )

    EXECUTE '/path/to/script.sh'
        ON MASTER
    FORMAT 'CSV';
</code></pre>

<p>Now you are ready to concretise the external data in a database table.</p>

<p>For that, you can use either an <code>INSERT INTO table SELECT * FROM ext_table</code>
or a <code>CREATE TABLE table AS SELECT * FROM ext_table</code>  </p>

<h2>Conclusions</h2>

<p>As you can imagine, this is an extremely powerful feature.
You can write a script to connect to every other DBMS and dump data or
produce live data with any programming language (for instance through RSS or Atom or XML feeds).</p>

<p>Feeding tables with OS script is one of my favourite features,
and I think the only limitation you have here is your imagination.</p>
]]>
    </content>
</entry>

<entry>
    <title>Intel SSDs:  Lifetime and the 320 vs. 710 Series</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2012/03/intel-ssds-lifetime-and-the-32.html" />
    <id>tag:blog.2ndquadrant.com,2012:/en//3.193</id>

    <published>2012-03-23T22:03:25Z</published>
    <updated>2012-03-23T22:47:19Z</updated>

    <summary><![CDATA[ This week I've been digging deep into PostgreSQL storage hardware again. &nbsp;Since I'm giving a conference talk on database storage in Austin and in the DC area next week, it seems like a good time for me to actually...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="United States News" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlssd" label="postgresql ssd" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[ <div>This week I've been digging deep into PostgreSQL storage hardware again. &nbsp;Since I'm giving a conference talk on database storage in <a href="http://pgday.austinpug.org/talks/">Austin</a> and in the <a href="http://pgday.bwpug.org/">DC area</a> next week, it seems like a good time for me to actually know the material. &nbsp;One of the most common questions here is "what's the cheapest SSD I can put my database on?", with the implied hope "...without <a href="http://wiki.postgresql.org/wiki/Reliable_Writes">losing it all the time</a>".  Last year the first inexpensive answer to that appeared on the market, and I suggested people <a href="http://blog.2ndquadrant.com/en/2011/04/intel-ssd-now-off-the-sherr-sh.html">take a look</a> at Intel's 320 series drives. &nbsp;With 217 days of runtime on my first 320 drive here, and Intel's 3rd generation storage line filled out with the more enterprise oriented 710 Series now, it's worth reviewing how that turned out.</div><div><br /></div><div>It wasn't long after the 320 series drives were introduced that people started reporting a firmware problem with the drive, where it did things like report a capacity of 8MB after a restart along with "BAD_CTX 0000013x" errors. &nbsp;A <a href="http://communities.intel.com/thread/24205?tstart=20">firmware update to fix that</a> was released. &nbsp;There's still some claims of <a href="http://communities.intel.com/thread/24339?tstart=0">continued problems</a> floating around. &nbsp;You have to expect some percentage of any product are going to be bad, and the later production of this drive (after the big bug was fixed) don't seem above the usual risk level in hard drives to me. &nbsp;With the warranty here <a href="http://newsroom.intel.com/community/intel_newsroom/blog/2011/05/19/chip-shot-new-5-year-limited-warranty-on-intel-ssd-320">extended to 5 years</a> (unless you're using it at 'enterprise usage levels'), I think that Intel would be getting killed if the reliability on these was as bad as some people claim.</div><div><br /></div><div>The reason behind the usage level caveat is the main thing worth talking about here. &nbsp;The <a href="http://www.intel.com/support/ssdc/hpssd/sb/CS-032510.htm">long version of the warranty</a> suggests "The media wear-out indicator reports a normalized value of 100 (when the <span class="caps">SSD </span>is brand new out of the factory) and declines to a minimum value of 1. When the value reads 1, this indicates that the <span class="caps">SSD </span>is reaching the wear-out limit". &nbsp;Here's what my first 320 looks like so far:</div>
<pre><code>
[root@toy ~]# smartctl -a /dev/sdc
=== START OF INFORMATION SECTION ===
Device Model:     INTEL SSDSA2CW120G3
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5225
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       34
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       201389
226 Load-in_Time            0x0032   100   100   000    Old_age   Always       -       2687040
227 Torq-amp_Count          0x0032   100   100   000    Old_age   Always       -       0
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       314526
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   099   099   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       201389
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       133021
</code></pre>
Not all of these attributes are labeled correctly, and some are "Unknown"; all the gory details are in the <a href="http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-specification.html">product specifications</a>.  You can see the Media Wearout above; here's the raw values for other interesting ones, formatted so they're more blog-friendly:
<pre><code>
ID# ATTRIBUTE_NAME          RAW_VALUE
  9 Power_On_Hours          5225
 12 Power_Cycle_Count       58
192 Power-Off_Retract_Count 34
225 Load_Cycle_Count        201389
241 Total_LBAs_Written      201389
242 Total_LBAs_Read         133021
</code></pre>

192 (hex C0) is "Power-Off Retract Count". &nbsp;That's how many unsafe shutdowns the drive has been through, which are the situations where the battery backed cache in the drive has been triggered. &nbsp;With 34 of them here, you can see I've tried to get this drive to die that way.<div><br /></div><div>The first interesting wear figure is 225 (hex code E1) which Intel's documentation describes as "Host Writes". &nbsp;The units for that are 32MB. &nbsp;If you look carefully, you'll see that's the same value given in 241 "Total <span class="caps">LBA</span>s Written". &nbsp;That suggests the <span class="caps">LBA </span>unit for the drive is also 32MB, which I <a href="http://archives.postgresql.org/pgsql-performance/2011-07/msg00194.php"> double-checked</a> last year. &nbsp;At 32MB each, my write value of 201389 means I've written 6.15TB to this drive.</div><div><br /></div><div>Now, computing the true lifetime of an <span class="caps">SSD </span>depends on a couple of magic values, like the "write amplification" of your workload. &nbsp;That suggests how often your workload forces small bits of data out to flash, using up some of the <span class="caps">NAND </span>cell lifetime faster than it might otherwise last. &nbsp;These numbers are really hard to estimate. &nbsp;The most realistic way is figure this out is to run a workload simulation after resetting the drive's internal counters, then see just how much you burned through. &nbsp;The process is walked through with an example at <a href="http://www.anandtech.com/show/5518/a-look-at-enterprise-performance-of-intel-ssds/6">"Measuring How Long Your Intel <span class="caps">SSD</span> Will Last"</a>, and it's not too hard to translate that example (which uses Intel's <span class="caps">SSD</span> Toolbox software) into a set of of smartctl commands if you're on Linux--the article even uses smartctl for the counter reset part.</div><div><br /></div><div>The official documentation is this is Intel's <a href="http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-enterprise-server-storage-application-specification-addendum.html">Enterprise Server addendum</a>, and here we finally find some hard numbers about the expected life of these drives. &nbsp;My 120GB drive is said to have a "write endurance" of 15TB. &nbsp;A pessimistic look at my sample drive here would check total writes and say that, having written over 6TB, I've gone through 40% of the drive lifetime. &nbsp;But write endurance doesn't work that way; the firmware is constantly doing tricks to extend the life of the drive. &nbsp;Intel's official number they sometimes tie the warranty to, the Media Wearout, is showing 99% left! &nbsp;If that's true--I've only used 1% of the drive's lifespan--then I might manage 600TB of writes before this one really dies on me.</div><div><br /></div><div>So what's the story with the true Enterprise lifetime 710 Series drives? &nbsp;Those almost the same drives as the 320 series ones, with three significant changes. &nbsp;First, they're said to use higher quality flash, probably with the same sort of "put the best tested chips first in the expensive models" approach Intel is said to use on their <span class="caps">CPU </span>production--what's sometimes called binning. &nbsp;Second, the drives are overprovisioned with a lot more unused flash compared to the 320 series models, and unused flash really helps extend longevity. &nbsp;Finally, they don't claim the capabilities to be quite as good. &nbsp;Random write <span class="caps">IOPS </span>numbers on the 710 series drives are lower; my 120GB 320 series drive is specified at 14K write <span class="caps">IOPS, </span>while the 100GB 710 series only aims for 2700.  The drive doesn't claim to support lots of tiny writes and still last for years, which means it's aimed at a different set of write amplification expectations. &nbsp;Similarly, the 710 series drives don't refresh the stored cells in the same way. &nbsp;The downside there is that 710 models are only specified to retain their data for 3 months. &nbsp;That's probably fine for data center use, but that wouldn't be very acceptable to the more consumer oriented market the 320 series is sold to.</div><div><br /></div><div>The end result of that, and how the 710 compares to the 320 series drives, is nicely summarized in the "Write Endurance" table in the <a href="http://www.tomshardware.com/reviews/ssd-710-enterprise-x25-e,3038-2.html">Tom's Hardware Review</a>. &nbsp;Instead of the 15TB endurance number my 320 drive specifies, the similar 100GB 710 series model aims for 500TB. &nbsp;That's just over 30X as long. &nbsp;In the real world, there may not be that big of a difference, as shown by the projected 600TB figure I'm seeing out of my 320 drive so far. &nbsp;But Intel's aiming at conservative engineering lifetimes on the specification sheets, and by that measure the storage cells 710 <strong>will</strong> last longer; the 320 models only <strong>may</strong> last longer. &nbsp;And an expected lifetime 30X as long is something some people are surely willing to pay the 710's price premium for.</div><div><br /></div> ]]>
        
    </content>
</entry>

<entry>
    <title>Using the PostgreSQL System Columns</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2012/01/using-the-postgresql-system-co.html" />
    <id>tag:blog.2ndquadrant.com,2012:/en//3.191</id>

    <published>2012-01-31T21:53:43Z</published>
    <updated>2012-01-31T22:31:18Z</updated>

    <summary><![CDATA[ There are a few parts of the PostgreSQL internals that poke out usefully if you look in the right place for them.&nbsp; One useful set to know about are the System Columns, which you can explicitly request but don't...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresql" label="postgresql" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[


	
	
	
	<style type="text/css">
	<!--
		@page { margin: 0.79in }
		P { margin-bottom: 0.08in }
		A:link { so-language: zxx }
		CODE.cjk { font-family: "DejaVu Sans", monospace }
	--></style>There are a few parts of the PostgreSQL internals that poke out usefully if you look in the right place for them.&nbsp; One useful set to know about are the <a href="http://www.postgresql.org/docs/current/static/ddl-system-columns.html">System Columns</a>, which you can explicitly request but don't see by default.&nbsp; For example:<br /><br /><blockquote>psql -x -c "SELECT oid,* FROM pg_class LIMIT 1"<br /></blockquote>There is no column named oid in the pg_class table, but it's there if you ask for it.&nbsp; The oid used to be relied on more heavily in PostgreSQL as a way to identify rows.&nbsp; That's not true for regular tables anymore, and you really don't want to start doing that for your own tables.&nbsp; OIDs are mainly useful now when joining parts of the <a href="http://www.postgresql.org/docs/current/static/catalogs.html">System Catalog</a> together.&nbsp; A good example is the <a href="http://wiki.postgresql.org/wiki/Disk_Usage">Disk Usage</a> query.&nbsp; If you want to find the namespace a table is in, you need to know you can ask for its OID.&nbsp; It's possible to get some of this data out of more portable views like information_schema.tables.&nbsp; But many of the useful things in this area are PostgreSQL specific.&nbsp; Sometimes I see people starting with the information_schema views and joining against other tables using its text name fields, such as the listed table_name.&nbsp; That approach has several edge cases that don't work out correctly; not handling <a href="http://www.postgresql.org/docs/current/static/storage-toast.html">TOAST</a> columns is a common example.&nbsp; That makes them more prone to breaking on you later, probably after your system has gone into production, than an OID based join.<br /><br />There is also a tableoid system column.&nbsp; As described in the documentation, its main use case is identifying which partition a row come from.&nbsp; That's not a great thing to be driving application logic from, but it can be useful for monitoring or troubleshooting purposes.&nbsp; For example, if you SELECT rows from the parent table in a partitioning inheritance scheme, it's normally expected that no rows will actually be stored there.&nbsp; Checking the tableoid is one way to confirm that.&nbsp; You might confirm that your INSERT/UPDATE trigger is moving rows to the right place using tableoid as well.&nbsp; It's possible to do that for each individual partition section, but running a query against the parent will make sure you hit every row in the table.<br /><br />Another internal column related to uniquely identifying rows is the ctid.&nbsp; The ctid is a direct pointer to the physical block (using PostgreSQL's 8K page size) and position of a row.&nbsp; ctids are a pair of numbers, and the first row will be (0,1).&nbsp; While this is the fastest way to find a row more than once in the same transaction block, these numbers are not stable in the long term.&nbsp; Any UPDATE and some maintenance operations will change them.&nbsp; One thing you can use these for is finding duplicate data in a table.&nbsp; Let's say you're trying to add a unique constraint, but one row in the table is duplicated 3 times, which blocks the unique index from being created.&nbsp; When rows are identical in every column, you can't write any simple SELECT statement to uniquely identify them.&nbsp; That means deleting all of them but one copy requires some annoying and fragile SQL code, combining DELETE with LIMIT and/or OFFSET--which is always scary.&nbsp; If you use the ctid instead, the implementation will be PostgreSQL specific, but it will also be faster and cleaner.&nbsp; See <a href="http://www.postgresonline.com/journal/archives/22-Deleting-Duplicate-Records-in-a-Table.html">Deleting Duplicate Records in a Table</a> for an example of how that can be done.<br /><br />The other system columns all relate to transaction visibility:&nbsp; xmin, cmin, xmax, cmax.&nbsp; When you delete a row in PostgreSQL, it isn't eliminated from disk immediately.&nbsp; It's possible that some other query that's executing at the same time will still need to see that row, and the <a href="http://www.postgresql.org/docs/current/static/transaction-iso.html">transaction isolation</a> in PostgreSQL worries about such things.&nbsp; If you ever want to learn how that isolation works, the way the <a href="http://www.postgresql.org/docs/current/static/mvcc.html">Multiversion Concurrency Control</a> (MVCC) implementation is handled, you can watch parts of it happen.&nbsp; Just open transactions in two different sessions, UPDATE/DELETE in one of them, and then look at those rows in the other.&nbsp; You can still see them in the session where they weren't touched, but they'll be marked to expire in the future via their xmax being set.&nbsp; To really pull that all together, you also need to know about some of the <a href="http://www.postgresql.org/docs/current/static/functions-info.html">System Information Functions</a>.&nbsp; <i>txid_current()</i> is the most useful for this sort of learning experience, it provides a reference point for the always increasing system transaction ID.&nbsp; You can find a more detailed exploration of using these functions and system columns in Bruce's <a href="http://momjian.us/main/presentations/internals.html">MVCC Unmasked</a> talk.&nbsp; The "Routine Maintenance" chapter of <a href="https://www.packtpub.com/postgresql-90-high-performance/book">my book</a> also shows examples how how MVCC works through the perspective of the system columns.<br />]]>
        
    </content>
</entry>

<entry>
    <title>Setting JDBC with Greenplum</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2012/01/setting-greenplum-and-jdbc.html" />
    <id>tag:blog.2ndquadrant.com,2012:/en//3.190</id>

    <published>2012-01-19T12:16:59Z</published>
    <updated>2012-01-23T11:26:35Z</updated>

    <summary>JDBC is the driver used to access a database with Java. Greenplum has a full working JDBC implementation. In this short article we&apos;ll see how to use it....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumcommunityedition" label="Greenplum Community Edition" scheme="http://www.sixapart.com/ns/types#tag" />
    <category term="jdbc" label="JDBC" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>JDBC is the driver used to access a database with Java. Greenplum has a full working JDBC implementation.
In this short article we'll see how to use it.</p>
]]>
        <![CDATA[<h2>Download and install</h2>

<p>It is possible to download the JDBC for Greenplum directly from the Greenplum Community Edition site (http://www.greenplum.com/community/downloads/database-ce/).
Look for the <em>"Connectivity Tools"</em> file.</p>

<p>You will receive a link to download the archive file.
Extract the archive and run the binary extracted. Then follow the instructions on screen and in less than a minute you have installed JDBC.</p>

<h2>Prepare the Greenplum server</h2>

<p>After a successful installation, make sure that the server accepts TCP connections from the desired hosts. Check that <em>listen_addresses</em> is properly set in postgresql.conf.</p>

<p><strong>Note:</strong> by default, Greenplum listens to any address.</p>

<p>Another aspect you have to consider is the user authentication, which is delegated to the pg_hba.conf file (please refer to page 36 of Greenplum AdminGuide for more information).</p>

<p>After you have verified the user is able to connect to the database, you can go on and test JDBC.</p>

<p>Connecting to a Greenplum Database with JDBC is a three steps procedure:</p>

<ul>
<li>Import JDBC</li>
<li>Load the driver</li>
<li>Connect to the database</li>
</ul>

<p>To import JDBC, add this line to the top of your Java source:</p>

<pre><code>import java.sql.*;
</code></pre>

<p>To load the driver, use this line:</p>

<pre><code>Class.forName("org.postgresql.Driver");
</code></pre>

<p>Remember that the <em>forName</em> function can throw a <em>ClassNotFoundException</em> if the driver is not available. We do not try to catch that exception in the simple example below. You should in your production environment.</p>

<p>To connect to a database using JDBC, you have to use a connection URL. It can be in one of these three forms:</p>

<ul>
<li>jdbc:postgresql:databasename</li>
<li>jdbc:postgresql://host/databasename</li>
<li>jdbc:postgresql://host:port/databasename</li>
</ul>

<p>Build an URL that suits your needs and use it with the getConnection function. For example:</p>

<pre><code>Connection db = DriverManager.getConnection(url, username, password);
</code></pre>

<h2>Issuing a query</h2>

<p>To perform queries on the database, you have to use a Statement or a PreparedStatement instance.
You can create a Statement object using the createStatement method of class Connection.
The result of a query execution is a ResultSet object, containing th entire result.</p>

<p>A ResultSet object can be iterated with the usual next() function, as shown in the example below.</p>

<h2>A simple example</h2>

<pre><code>import java.sql.*;

public static void main(){

    Class.forName("org.postgresql.Driver");
    Connection db = DriverManager.getConnection("jdbc:postgresql://localhost/testdb","username", "passw0rd");

    Statement st = db.createStatement();
    ResultSet rs = st.executeQuery("SELECT * FROM mytable WHERE columnfoo = 500");
    while (rs.next()) {
        System.out.print("Column 1 returned ");
        System.out.println(rs.getString(1));
    }
    rs.close();
    st.close();
}
</code></pre>

<p>It is very important to understand how to bind values in queries, in order to prevent from SQL injection issues. The following snippet is an example for that:</p>

<p><pre><code>int foovalue = 500;
PreparedStatement st = conn.prepareStatement("SELECT * FROM mytable WHERE columnfoo = ?");
st.setInt(1, foovalue);
ResultSet rs = st.executeQuery();
while (rs.next()) {
    System.out.print("Column 1 returned ");
    System.out.println(rs.getString(1));
}
rs.close();
st.close();
</pre></code></p>
]]>
    </content>
</entry>

<entry>
    <title>Greenplum 4.2 is out!</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2012/01/greenplum-42-is-out.html" />
    <id>tag:blog.2ndquadrant.com,2012:/en//3.189</id>

    <published>2012-01-02T08:39:34Z</published>
    <updated>2012-01-16T09:39:36Z</updated>

    <summary>With an announce on the forum, Greenplum staff has spoke out about the new version of their Database Management System. I can&apos;t resist to blog about some of its new features....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplum42" label="Greenplum 4.2" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>With an announce on the forum, Greenplum staff has spoke out about the new version of their Database Management System.
I can't resist to blog about some of its new features.</p>
]]>
        <![CDATA[<h2>Cool new features</h2>

<p>You can find a <a href="http://www.greenplum.com/community/forums/showthread.php?565-Announcing-Greenplum-Databse-4.2">detailed summary of the new features on Greenplum's website</a>.</p>

<p>The first in the list is Greenplum <strong>Database Extension Framework</strong>.</p>

<p>You can think about it as a feature similar to Postgres 9.1 <em>EXTENSION</em> feature (even if there are no similarities in terms of implementation).
That is an amazing way to simplify extensions installing, uninstalling and upgrading.
I think that this feature will bring more <em>logic</em> into the database.</p>

<p>Another interesting improvement is the <strong>High-Performance gNet for Hadoop</strong>.</p>

<p>Everybody knows <em>Hadoop</em>, and even if you don't know it in details, you have at least heard about it.
With this feature, you can import/export data from/to hadoop clusters so fast, using the gNet protocol.
I advice you to try it, it is included in the Community Edition!</p>

<p>Helpers for <strong>migrations</strong> from other database systems.</p>

<p>Greenplum 4.2 offers some helpers to make a migration from other databases a lot easier.
For example, there are 20 Oracle functions that runs natively on Greenplum 4.2.
If you know Postgres, you could think about a sort of <code>orafce</code>.
That is a delicate field, I feel to advice you not to underrate the workload of a migration -
and do not think that everything can be made automatically.</p>

<p><strong>XML support</strong>.</p>

<p>That's fantastic. I am a Postgres user, administrator and consultant, and I really love Postgres XML support.
Now Greenplum has that feature. I am sure I will blog more about this in the future, with some practical usage.</p>

<p>Targeted <strong>Performance Optimization</strong>.</p>

<p>Greenplum 4.2 has some improvements regarding performance. </p>

<p>The 4.2 version is a lot smarter, it transparently eliminate irrelevant partitions in a table, producing that smaller amount of data needs to be scanned to obtain query results.
Another transparent operation is about memory: Greenplum 4.2 has a more efficent memory management, which results in a better memory utilization and higher concurrency.   </p>

<p>In the next weeks I will try some of these features and blog about them, waiting for the Community Edition to be out!</p>

<p>So stay tuned with us on  the Greenplum Community Forum and on 2ndQuadrant blog to get news about it.</p>
]]>
    </content>
</entry>

<entry>
    <title>How to initialize Greenplum on multiple nodes</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/12/an-handybook-to-init-greenplum.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.188</id>

    <published>2011-12-16T12:00:33Z</published>
    <updated>2011-12-16T18:04:43Z</updated>

    <summary>In the previous article we have seen how to install Greenplum on multiple nodes. After installation steps, we must init the entire system. Let&apos;s see how....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumce41initialitazionhandbook" label="Greenplum CE 4.1 Initialitazion handbook" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>In the <a href="http://blog.2ndquadrant.com/en/2011/12/a-greenplum-41-handbook.html">previous article</a>  we have seen how to install Greenplum on multiple nodes.
After installation steps, we must init the entire system. 
Let's see how.</p>
]]>
        <![CDATA[<h2>Actual situation</h2>

<p>If tou have followed previous article, you have a Greenplum installed on multiple nodes.</p>

<p>Standar procedure when dealing with a Greenplum database, as well as Postgres, is composed by:</p>

<ul>
<li>Installation</li>
<li>Initialization</li>
<li>Database Start</li>
</ul>

<p>In this article, we will see the second and third steps: initializating and starting the database.</p>

<h2>Database initialization</h2>

<p>The script that do the job here is <code>gpinitsystem</code>.
<code>gpinitsystem</code> needs a special configuration file, who contains a list of segment host addresses <em>only</em>.
Let's name it <code>hostfile_gpinitsystem</code>.</p>

<p>For example:</p>

<pre><code>segment-hostname-1
segment-hostname-2
...
</code></pre>

<p>One more file is needed, its name is <code>gpinitsystem<em>config</code>. It contains a lot of parameters to configure your system.
An example configration file, to be used as a template, is in <code>$GPHOME/docs/cli</em>help/gpconfigs/gpinitsystem_config</code>.
You can copy that example file and modify it to suits your needs.
A detailed list of all parameter meanings is on Admin Guide at page 67.</p>

<p>The very important part of the file is:</p>

<pre><code>ARRAY_NAME="EMC Greenplum DW"
PORT_BASE=40000
declare -a DATA_DIRECTORY=(/data1/primary /data1/primary
/data1/primary /data2/primary /data2/primary /data2/primary)
MASTER_HOSTNAME=master-hostname
MASTER_DIRECTORY=/data/master
MASTER_PORT=5432
</code></pre>

<p>Now you can run the <code>gpinitsystem</code> utility with those two files as parameters, this way:</p>

<pre><code>$ gpinitsystem -c gpinitsystem_config -h hostfile_gpinitsystem
</code></pre>

<p>After a succesfully initialization you will see a kind message:</p>

<pre><code>Greenplum Database instance successfully create+.
</code></pre>

<p>In case of failure, it is possible to get a partially installed system.</p>

<p>Because of that, after a failure in <code>gpinitsystem</code> Greenplum automatically creates a "rollback" script
that can be executed to cleanup things around.</p>

<p>An example cleanup script file would be named:</p>

<pre><code>backout_gpinitsystem_gpadmin_20111216_121053
</code></pre>

<p>Is it possible to execute that using GNU bash:</p>

<pre><code>$ sh backout_gpinitsystem_gpadmin_20111216_121053
</code></pre>

<p>All Greenplum files partially installed will be removed.</p>

<p>That's all for now.
I hope that this article, together with the
<a href="http://blog.2ndquadrant.com/en/2011/12/a-greenplum-41-handbook.html">previous one</a>, will show you
how easy Greenplum installation is.</p>
]]>
    </content>
</entry>

<entry>
    <title>A Greenplum 4.1 installation handbook</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/12/a-greenplum-41-handbook.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.187</id>

    <published>2011-12-05T17:32:53Z</published>
    <updated>2011-12-05T17:39:55Z</updated>

    <summary>One of the main advantages using Greenplum is that it gains power when it uses multiple nodes. Horizontal scalability is a main feature of Greenplum. Here is a compact handbook to install a multi-node Data Warehouse environment with Greenplum....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumce41installationhandbook" label="Greenplum CE 4.1 Installation handbook" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>One of the main advantages using Greenplum is that it gains power when it uses multiple nodes.
Horizontal scalability is a main feature of Greenplum.</p>

<p>Here is a compact handbook to install a multi-node Data Warehouse environment with Greenplum.</p>
]]>
        <![CDATA[<h2>Preparation steps</h2>

<p>This little guide covers Greenplum 4.1 installation.
This is not intended to be a replacement for the official Install Guide, just a little handbook to keep on your desk.</p>

<p>You have to tune your Operating System a little bit before installing Greenplum.
That's a very well documented procedure, I advice you to read it in the Install Guide at page 18.</p>

<h2>Installing Greenplum</h2>

<p>First of all, you have to run the Greenplum installer script on master host, as <code>root</code>.
The installer script can be downloaded from greenplum community site: http://www.greenplum.com/community/downloads/database-ce</p>

<p>Make sure to download the correct version!</p>

<p>The installer script displays some question and the license, simply follow instructions on video.</p>

<p>Now comes the important part, you have tu run a special script, that setup Greenplum on a list of hosts for you. Awesome!
It simply copies the Greenplum installation from the actual host to a list of specified hosts (it cares about ssh keys exchanging and <code>gpadmin</code> user creation).</p>

<p><em>Specified where?</em></p>

<p>The important file here is <code>hostfile_exkeys</code>, it must contains hostnames for each host in your Greenplum system. For example:</p>

<pre><code>master-hostname
master-segment-hostname
segment-hostname-1
segment-hostname-2
...
</code></pre>

<p>this is enough to run <code>gpseginstall</code>, run in this way:</p>

<pre><code># gpseginstall -f hostfile_exkeys -u gpadmin -p yourpassword
</code></pre>

<h2>Creating directories</h2>

<p>It's time to create the <code>master</code> directory on master host.
Remember that real data are on segments, so no much space is needed here.
For example:</p>

<pre><code># mkdir /data/master
# chown gpadmin /data/master
</code></pre>

<p>You have to create that directory on your master segment as well.
Greenplum provides a useful script to do the job, it is called <code>gpssh</code>:</p>

<pre><code># gpssh -h master-segment-hostname -e 'mkdir /data/master'
# gpssh -h master-segment-hostname -e 'chown gpadmin /data/mast
</code></pre>

<p>Finally, you have to create data directories on all segments host, and tou can do that
all at once, thanks to <code>gpssh</code>.</p>

<p>Remember that real data goes there, so a lot of space is needed.</p>

<p>Create a file called </p>

<pre><code>hostfile_gpssh_segonly</code></pre> 

<p>and place <em>only</em> segments hostnames in it. For example:</p>

<pre><code>segment-hostname-1
segment-hostname-2
</code></pre>

<p>Now, run commands an all segments at once like this:</p>

<pre><code># gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/primary'
# gpssh -f hostfile_gpssh_segonly -e 'mkdir /data/mirror'
# gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/primary'
# gpssh -f hostfile_gpssh_segonly -e 'chown gpadmin /data/mirror'
</code></pre>

<h2>Conclusions</h2>

<p>Here's a list of steps to keep on your desk, I hope you will find it useful:</p>

<ul>
<li>Configure your Operating System for Greenplum (as written in Install Guide)</li>
<li>Install Greenplum on master host</li>
<li>Run <code>gpseginstall</code> to install Greenplum on other hosts</li>
<li>Create master directory on the master</li>
<li>Create the same directory on master segment (<code>gpssh</code> can help here)</li>
<li>Create data directories on segments (<code>gpssh</code> can help here)</li>
</ul>

<p>In the next article, we will see how to init and start the Greenplum Database we have just installed.
Stay tuned.</p>

<p>Cheers</p>
]]>
    </content>
</entry>

<entry>
    <title>PGDay.IT 2011 was &quot;bellissimo&quot;!</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/11/pgdayit-2011-was-bellissimo.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.186</id>

    <published>2011-11-28T12:26:39Z</published>
    <updated>2011-11-28T12:51:07Z</updated>

    <summary>The fifth edition of the Italian PGDay went well beyond our initial expectations. We had about 75 participants, a total of 95 people including staff and speakers.As I said during the event, rather than PGDay Italy, this should be named...</summary>
    <author>
        <name>Gabriele Bartolini</name>
        <uri>http://www.2ndquadrant.it/chi-siamo/#bartolini</uri>
    </author>
    
        <category term="Gabriele&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="pgdayit2011" label="PGDay.IT 2011" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[The fifth edition of the Italian PGDay went well beyond our initial expectations. We had about 75 participants, a total of 95 people including staff and speakers.<br />As I said during the event, rather than PGDay Italy, this should be named PGDay for Italian speakers given the presence of staff from Switzerland (Canton Ticino). Participants came from 12 regions: all regions but Val d'Aosta in the north/centre area, but also from Southern Italy (Naples and Calabria).<br />]]>
        <![CDATA[In any case, it was fantastic to get back to Prato after 3 years, in the
 Monash&nbsp; University Prato Centre. The atmosphere was very similar to the
 first editions and we are really happy to see the community grow (last 
year we had 60 participants in total, which means an increment of 60%).<br />
Even the quality of the talks was in my humble opinion very high.<br />
The audience paid a tribute to Magnus Hagander for his nomination in the
 core team, then Magnus wonderfully covered the new features of 
PostgreSQL 9.1.<br /><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/2011/11/28/pgday-it.html" onclick="window.open('http://blog.2ndquadrant.com/en/2011/11/28/pgday-it.html','popup','width=2592,height=1936,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/2011/11/28/pgday-it-thumb-500x373.jpg" alt="Magnus Hagander's keyone at PGDay.IT 2011" class="mt-image-right" style="float: right; margin: 0 0 20px 20px;" height="373" width="500" /></a></span><br />
Simon Riggs talks were on the future of PostgreSQL and on NoSQL 
databases. Very interesting. Andreas Scherbaum covered data warehouse 
topics.<br />
Other interesting talks were:<br />
<ul><li>&nbsp;the experience on open source and Postgres for CSI Piemonte, one of the main organisations working for local governments</li><li>ORM and Perl by Ferruccio Zamuner</li><li>Node.js and Postgres by Lucio Granzi<br />
  </li><li>Serialisable snapshot isolation, covered by Marco Nenciarini</li><li>repmgr by Carlo Ascani</li><li>foreign tables and data wrappers by Giulio Calacoci</li></ul>
<p><br /></p><p>Unfortunately I missed Gianni's talk on debugging with CTEs as I was 
giving my speech on the project we (as 2ndQuadrant Italia) have been 
working in the last months: <a href="http://www.pgbarman.org/">BaRMan, backup and recovery manager for PostgreSQL</a>.
 The feedback I received was excellent (especially by certified Oracle 
engineers). We hope we can release it by the end of 2011 (depending on 
sponsorships) or, at the latest, early 2012.</p>
<p>In any case, even though organising this kind of events as community 
is not an easy task, the success of this edition is a source of 
motivation for all of us. We hope we can start much earlier with the 
organisation and we hope we can find more sponsors/partners for next 
year.</p>
<p>I take the opportunity to thank all the speakers that took part to 
PGDay and all the staff members: Diego, Luca, Gianluca, Cosimo, 
Maurizio, Marco (Tofanari), Emanuele and the team from 2ndQuadrant 
Italia (Carlo, Giulio, Marco, Gianni and Simone).<br />
</p>
<div><br /></div>]]>
    </content>
</entry>

<entry>
    <title>Using Greenplum 4.1 in Ubuntu 11.10</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/11/greenplum-41-on-ubuntu-server.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.185</id>

    <published>2011-11-28T08:36:08Z</published>
    <updated>2011-11-28T11:51:24Z</updated>

    <summary>Greenplum does not officially support Ubuntu Server 11.10 as underlying operating system. However, I needed to install it on the most recent Ubuntu server just to perform some tests and evaluate it....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumceonubuntulinux" label="Greenplum CE on Ubuntu Linux" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Greenplum does not officially support Ubuntu Server 11.10 as underlying operating system.
However, I needed to install it on the most recent Ubuntu server just to perform some tests and evaluate it.</p>
]]>
        <![CDATA[<h2>Some words on Ubuntu Server</h2>

<p>Ubuntu Server is increasing its popularity every day.
Solutions like <em>Canonical Landcape</em> are getting more and more fans.</p>

<p>In my case, I have to perform some tests with Greenplum and I have an Ubuntu Server available.
I am not doing this for a production environment and I do not advice you to use Greenplum on Ubuntu on production systems.
EMC does not officially support Ubuntu, and that should suffice.</p>

<h2>Greenplum on Ubuntu, a quick and dirty approach</h2>

<p>Assuming that you have an Ubuntu 11.10 Server up and running (64 bit version), here are some tips to install Greenplum 4.1.</p>

<p>Greenplum installation script checks if you are running a CentOS operating system thanks to the presence of <code>/etc/redhat-release</code> file.
Let's create that file on our Ubuntu Server, specifying a "fake" CentOS release:</p>

<pre><code>echo "RedHat/CentOS" > /etc/redhat-release
</code></pre>

<p>Ok, that's an horrible dirty hack, but let's move on.</p>

<h2>The actual installation process</h2>

<p>You can follow the <a href="http://bitcast-a.v1.sjc1.bitgravity.com/greenplum/Greenplum_CE_Database/documentation/GP-4111-InstallGuide.pdf">Greenplum 4.1.1 Installation Guide</a>, download it from the community site.
I have just tested a single node cluster.</p>

<h2>Conclusion</h2>

<p>Greenplum does not support Ubuntu Server.
Please do not use Greenplum on Ubuntu Server for production environments, or if you wish so make sure you contact EMC.
However, this article shows how easy it is to install Greenplum on the latest Ubuntu server for evaluation purposes.</p>
]]>
    </content>
</entry>

<entry>
    <title>Mapreduce in Greenplum 4.1 - 2nd part</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/11/mapreduce-in-greenplum-2nd.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.183</id>

    <published>2011-11-17T14:25:09Z</published>
    <updated>2011-11-17T16:22:57Z</updated>

    <summary>Through this article, we are going to complete the MapReduce job started in the previous article....</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumcommunityedition" label="Greenplum Community Edition" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Through this article, we are going to complete the MapReduce job started in the <a href="http://blog.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html">previous article</a>.</p>
]]>
        <![CDATA[<h2>Take up the problem from the previous article</h2>

<p>In the <a href="http://blog.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html">previous article</a>, we left with this MapReduce configuration file:</p>

<pre><code>%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }

EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM
</code></pre>

<p>Which produces the following output:</p>

<pre><code>key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)
</code></pre>

<p>Naturally speaking, that job sums all values from two different columns of a test table.</p>

<p>Our goal here, is to use execute a division of these two values, in particular <code>15</code> and <code>278</code>.</p>

<p>Let's check what the result is with a calculator, just to be sure that the MapReduce job will return the correct value:</p>

<pre><code>$ psql -c "SELECT 15/278::FLOAT AS result" test_database
       result
 0.0539568345323741
(1 row)
</code></pre>

<p>Yes, we use Greenplum as a calculator :).</p>

<h2>Introducing "tasks"</h2>

<p>What we are doing here is to define a separate task that performs the sum.
We will use the result of that task as input for a query that actually does the division step.
Let's see it in practice.</p>

<ul>
<li>Remove the <code>EXECUTE</code> part from <code>test.yml</code>. In details, these lines:</li>
</ul>

<pre><code>EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM
</code></pre>

<ul>
<li>Define a task, wich is responsible to execute the sum of <em>x</em> and <em>y</em> values. To do that, it reuses the old map function.
Append this to <code>test.yml</code>:</li>
</ul>

<pre><code>- TASK:
        NAME: sums
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM
</code></pre>

<p>The useful characteristic of tasks is that they can be used as input for further processing stages.</p>

<ul>
<li>Define the step that performs the division, actually. It is an SQL SELECT that use the task defined earlier as input. Append this to <code>test.yml</code>:</li>
</ul>

<pre><code>- INPUT:
        NAME: division
        QUERY: |
            SELECT
                (SELECT value FROM sums where key = 'Sum of x') /
                (SELECT value FROM sums where key = 'Sum of y')
                AS final_division;
</code></pre>

<p>As you can see, the <code>FROM</code> clause contains the name of the task defined above: <code>sums</code>.</p>

<ul>
<li>Finally, execute the job and displays output. Append this to <code>test.yml</code>:</li>
</ul>

<pre><code>EXECUTE:
    - RUN:
        SOURCE: division
        TARGET: STDOUT
</code></pre>

<p>This step runs the <em>division</em> query and display the result via standard output.</p>

<h2>Put everything together</h2>

<p>This is the complete <code>test.yml</code> file:</p>

<pre><code>%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }
    - TASK:
        NAME: sums
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

    - INPUT:
        NAME: division
        QUERY: |
            SELECT
                (SELECT value FROM sums where key = 'Sum of x') /
                (SELECT value FROM sums where key = 'Sum of y')
                AS final_division;

EXECUTE:
    - RUN:
        SOURCE: division
        TARGET: STDOUT
</code></pre>

<p>Execute the whole job with:</p>

<pre><code>$ gpmapreduce -f test.yml
mapreduce_2235_run_1
    final_division
0.0539568345323741
(1 row)
</code></pre>

<p>Compare it with the calculator result. Ok, it matches.</p>

<h2>Conclusion</h2>

<p>The task is complete. We have calculated <code>sum(x)/sum(y)</code> correctly.</p>

<p>The power of MapReduce is mainly in the number of servers involved in the calculation.</p>

<p>Many servers accomplishes small calculation to get the final result.
Maybe you will not notice the powerful of MapReduce here, but this is a good starting point.</p>
]]>
    </content>
</entry>

<entry>
    <title>Global trends in deploying PostgreSQL</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/11/global-trends-in-deploying-pos.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.184</id>

    <published>2011-11-14T18:08:39Z</published>
    <updated>2011-11-21T01:39:00Z</updated>

    <summary><![CDATA[This year's conference lineup led me all over the world, a giant rectangle triangle going from the west coast of the US, north to Canada, east to the UK and Amsterdam, then ending south in Brazil.&nbsp; I've now locked myself...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresql" label="postgresql" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[This year's conference lineup led me all over the world, a giant rectangle triangle going from the west coast of the US, north to Canada, east to the UK and Amsterdam, then ending south in Brazil.&nbsp; I've now locked myself in to focus on the 3rd CommitFest for PostgreSQL 9.2, which began a few days ago.&nbsp; Check out the 2011-11 section of the <a href="https://commitfest.postgresql.org/">CommitFest tracker</a> to see what changes have been submitted, we're always looking for new volunteer <a href="http://wiki.postgresql.org/wiki/Reviewing_a_Patch">patch reviewers</a>.<br /><br />Talking to people deploying PostgreSQL in several countries during a short span of time has given me some interesting perspective on where the project is at.&nbsp; I follow a lot of adoption trends in the US, and some of those I assume are quirks in how business is done in this country.&nbsp; But when I hear the same sort of feedback from people in all four of the countries I've been to this year, too, it's clear this is a larger issue.<br /><br />The first thing I'm seeing a surprising amount of is satisfaction with the feature set in PostgreSQL.&nbsp; A few years ago, conversations about what you could and couldn't do with PostgreSQL usually stalled on one of a few common requests.&nbsp; There's a good survey of <a href="http://postgresql.uservoice.com/forums/21853-general">PostgreSQL feature feedback</a> at User Voice.&nbsp; 13 important features originally on that list have been closed already, with Index Only scans as the next expected to fall in the upcoming 9.2.&nbsp; PostgreSQL now includes regular and synchronous replication as of 9.1.&nbsp; pg_upgrade has been getting an increasing amount of testing that proves it works for many in-place upgrade scenarios.&nbsp; Extensions are dramatically easier to use now.<br /><br />It seems the total feature set has crossed the threshold where PostgreSQL is good enough for a whole lot more deployment situations than it used to be.&nbsp; What I'm hearing from people all over the world now is that the basic feature set and performance of PostgreSQL isn't failing the "checkbox test" so often anymore, where business people require certain things before they'll even consider a database.&nbsp; There are some major wants that are some distance off, such as materialized views and better OLAP support (cube/rollup/etc.)&nbsp; And using partitioning for bigger data sets is harder than people would like.&nbsp; But these are all things that are only needed for larger deployments, and some workarounds exist if you're willing to work at them.<br /><br />If the feature set isn't holding back as many deployments now, what is?&nbsp; Well, the next thing I've been hearing everywhere is on that survey list too:&nbsp; better administration and monitoring tools.&nbsp; You really need a whole open-source stack to monitor PostgreSQL right now, from OS+database trending to query log analysis.&nbsp; It's fine for these tools to live outside the database core, but some changes are clearly needed to make such tools easier to write.&nbsp; For example, the one built-in tool that allows query monitoring is pg_stat_statements, and the limitations preventing it from being useful to most people are so obvious we've gotten <a href="https://commitfest.postgresql.org/action/patch_view?id=681">two</a> <a href="https://commitfest.postgresql.org/action/patch_view?id=693">submissions</a> to improve it in the last month.<br /><br />There are a few projects that aim at the monitoring/administration problem.&nbsp; EnterpriseDB's PostgreSQL Enterprise Manager, Cybertec's pgwatch, OmniTI's Reconnoiter, the suite of smaller tools from End Point, and even the text UI of pg_statsinfo all hit the edges of this problem.&nbsp; What I hear when I have my advocacy hat on is that the community needs a major open-source project bigger than any of these to make database monitoring easier.&nbsp; That's now one of the major distinguishing features the commercial competition has.&nbsp; Getting enough of the people developing in this area all pointing in the same direction and working together is a big challenge though.<br /><br />On a related note, now that the underlying features are there, it seem making replication easier to monitor and setup is a major issue too.&nbsp; There are so many choices in replication technologies available for PostgreSQL it's easy for new people to get overwhelmed by them all.&nbsp; And the documentation guides around this area are still filled with a lot of complications that aren't even really necessary to get started at this point.&nbsp; It's easy for newcomers get dragged into details like how old style archiving works as a precusor to setting up even basic replication, despite that they're using the easier features in the current PostgreSQL instead.&nbsp; This area still has some work in the core database happening in 9.2, and it will be important for the community to create replication guides that include current information covering both 9.1 and that release.&nbsp; What I'm hear from every country I visit now is "I need material to help me compete against the idea of using Oracle RAC".<br /><br />The last of the global trends that have really jumped out at me is how companies everywhere are reinventing the development process around database applications.&nbsp; In some places, mostly bigger companies and government installations in particular, the expected staff "stack" is business as usual; it hasn't changed in a long time.&nbsp; New applications go from Developer to DBA to systems administrator.&nbsp; Management ideas like DevOps are catching on to improvement interface between these roles, but not really upset its basic structure.&nbsp; Everywhere I go now, I'm seeing everything but the developer role being squeezed out.&nbsp; ORM-driven development is eliminating the DBA's role in database design.&nbsp; Managed application hosting platforms are wiping out the systems administrator role.&nbsp; Startups with an idea for a web application go right from developer to deployment, and happily this is increasingly happening with a PostgreSQL backend in the database role.&nbsp; There isn't even the perception that DBA-like help might be needed until the application grows quite a bit.&nbsp; I'm seeing the need for better database specific optimization skills than a typical developer has being deferred until the application has tens of gigabytes of data to sling around.<br /><br />Being able to deploy small PostgreSQL installs and grow them to a reasonable size without specialized DBA knowledge is a great thing as far as I'm concerned.&nbsp; The exact advances in things like ORMs that have allowed reaching this point across the world are a topic that deserves its own long discussion.&nbsp; I'm going to cut this off here and return to that later.&nbsp; In this country, there's some concrete work around the 9.2 release that needs to get done this month.<br /> ]]>
        
    </content>
</entry>

<entry>
    <title>Performing ETL using Kettle with GPFDIST and GPLOAD</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/11/performing-etl-with-kettle-greenplum-gpfdist-gpload.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.181</id>

    <published>2011-11-07T09:30:00Z</published>
    <updated>2011-11-07T16:38:15Z</updated>

    <summary>Scenario:We have a remote datasource, served by a gpfdist server. We need to import the data in a Greenplum database, while performing some ETL manipulation during the import. It is possible to accomplish this goal with a simple transformation in...</summary>
    <author>
        <name>Giulio Calacoci</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="ETL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumandkettle" label="Greenplum and Kettle" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<h2>Scenario:</h2><p>We have a remote datasource, served by a gpfdist server. We need to import the data in a Greenplum database, while performing some ETL manipulation during the import.
<br />It is possible to accomplish this goal with a simple transformation in a few steps using Kettle.</p> ]]>
        <![CDATA[<p>For the sake of simplicity, I assume that:</p>
        <ul>
            <li>You know Kettle basics, like connections, jobs, and translations - for further information you can look at <a href="http://blog.2ndquadrant.com/en/2011/10/etl-with-kettle-on-greenplum-setup.html">my previous article on Kettle and Greenplum</a><br /></li>
            <li>You have a very simple test database on Greenplum, with one table. For the purposes of this article I used a small table called "<i>test</i>" with two columns "<i>id</i>" and "<i>md5</i>".</li>
            <li>A GPFDIST server running, with a csv file exposed.</li>
            <li>GPLOAD installed on your machine. You can download it for your architecture from the Greenplum Site : http://www.greenplum.com/community/downloads/</li>
        </ul>
        <h2>Using external tables</h2>
        <p>With Greenplum's external tables and parallel file server, gpfdist, efficient data loads can be achieved.</p>
        <p>Using the "<i>SqlScript"</i> component, we can create an external table at the beginning of our transformation.</p>
        <p>The "<i>Create External Table</i>", as shown below creates an external table, named "<i>external_samples"</i>.&nbsp; Data is provided by one location in our case, but two or more locations on the same ETL server can be used. One instance of gpfdist is running on this server, on port 8081. Double click on the "<i>SQLScript</i>" component and insert the code for the creation of an external table - for example this is our script:</p>
        <pre>drop external table if exists "public".external_samples;
create external table "public".external_samples( 
id int , 
md5 text 
) 
location('gpfdist://bravo01:8081/testdata.txt')
format 'TEXT' (DELIMITER ',')        </pre>
        <p>this is a simple test table, with 2 columns:</p>
        <ul>
            <li>id</li>
            <li>md5</li> 
        </ul>
        <span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/sql-external.html" onclick="window.open('http://blog.2ndquadrant.com/en/sql-external.html','popup','width=968,height=587,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/sql-external-thumb-600x363.png" alt="sql-external.PNG" class="mt-image-none" height="363" width="600" /></a></span>
        <p><br /></p><p>Using the &nbsp;"<b>DELIMITER</b>" keyword at the end of the sql script, Greenplum will use the ',' character as delimiter of the columns, like a CSV file.</p>
        <p>Now we have a read only table. Drag a "Table input" component and drop it in the design panel. Connect the script component to the input table component, and double click on the table one.</p>
        <p>Insert the select statement by specifying the columns you wish to import. For example:</p>
        <pre>SELECT id,md5 FROM external_samples;</pre>
        <p>This will perform the extraction of the data from the external table.</p><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/table-input.html" onclick="window.open('http://blog.2ndquadrant.com/en/table-input.html','popup','width=1093,height=556,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/table-input-thumb-600x305.png" alt="table-input.PNG" class="mt-image-none" height="305" width="600" /></a></span>
        <p><br /></p><p>This Stream of data can be used for some manipulations. For the sake of simplicity no ETL operations will be performed in this article (for an introduction on ETL with Greenplum and Kettle, you can read my previous article : "<i><a href="http://blog.2ndquadrant.com/en/2011/10/etl-with-kettle-on-greenplum-setup.html">ETL with Kettle and Greenplum</a></i>"), so we can focus on the use of the GPFDIST and GPLOAD components.</p>
        <h2>Using the Greenplum Load component</h2>
        <p>From the bulk loading folder,  inside the Design tab on the left side of the Kettle window, drag a "<i>Greenplum Load"</i> component, and drop it near the "<i>Table Input</i>" element. Then connect the two elements.</p><p><a href="http://blog.2ndquadrant.com/en/bulkload.html" onclick="window.open('http://blog.2ndquadrant.com/en/bulkload.html','popup','width=266,height=216,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/bulkload-thumb-266x216.png" alt="bulkload.PNG" class="mt-image-none" height="216" width="266" /></a></p>
        <p>Now double click on the "<i>Greenplum Loader"</i> component.</p>
        <p>Choose, as usal, the connection that has to be used.</p><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/gpload-conf.html" onclick="window.open('http://blog.2ndquadrant.com/en/gpload-conf.html','popup','width=811,height=489,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/gpload-conf-thumb-600x361.png" alt="gpload-conf.PNG" class="mt-image-none" height="361" width="600" /></a></span>
  
        <p><strong>IMPORTANT</strong></p>
        <p><b>Due to a Bug, if you fill the "target schema" field, the component will generate a configuration file with wrong format, so leave the schema field alone, and fill the "Target Table" with the &lt;schema&gt;.&lt;table&gt; syntax.</b></p>
        <p>We are almost there.</p>
  
        <p>We need to map the table fields and the stream fields. However, the external table that we want to create, at the moment doesn't exist yet on the server. So, we have to map them manually.</p>
        
        <p>On the "<i>localhost names</i>" panel you have to specify which port number gpfdist file distribution program uses - typically 8081 works fine. Be sure to choose a free port. Also you have to insert the IP of the machine running gpload (usually it is localhost or 127.0.0.1). If the host is using several NIC cards then the host name or IP address of each NIC card can be specified.</p>
        <span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/Schermata%202011-10-20%20a%2021.14.25.html" onclick="window.open('http://blog.2ndquadrant.com/en/Schermata%202011-10-20%20a%2021.14.25.html','popup','width=809,height=314,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/Schermata%202011-10-20%20a%2021.14.25-thumb-600x232.png" alt="Schermata 2011-10-20 a 21.14.25.PNG" class="mt-image-none" height="232" width="600" /></a></span>
        <p>In the "<i>GP Configuration"</i> tab you can insert the path to where the GPload utility is installed. It is possible to define the name of the GPload control file that will be generated and the name of the Data file(s) that will be written for subsequent load operations in the target tables by GPload (you can also specify the encoding and the file column delimiter).</p>
        <span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/gp-conf-spec.html" onclick="window.open('http://blog.2ndquadrant.com/en/gp-conf-spec.html','popup','width=815,height=315,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/gp-conf-spec-thumb-600x231.png" alt="gp-conf-spec.PNG" class="mt-image-none" height="231" width="600" /></a></span>
        <p><br /></p><p>Once all the elements are configured and connected, your transformation should look like this:&nbsp;</p><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/connection-order.html" onclick="window.open('http://blog.2ndquadrant.com/en/connection-order.html','popup','width=408,height=137,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/connection-order-thumb-408x137.png" alt="connection-order.PNG" class="mt-image-none" height="137" width="408" /></a></span><p>Execute the transformation. Everything should work fine, and the data will be imported in the destination table.</p>]]>
    </content>
</entry>

<entry>
    <title>Mapreduce in Greenplum 4.1</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.180</id>

    <published>2011-10-31T13:46:11Z</published>
    <updated>2011-11-02T11:05:06Z</updated>

    <summary>Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004. It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article. This is a...</summary>
    <author>
        <name>Carlo Ascani</name>
        <uri>http://www.2ndQuadrant.it/</uri>
    </author>
    
        <category term="Greenplum" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="greenplumcommunityeditionandmapreduce" label="Greenplum Community Edition and MapReduce" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[<p>Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004.
It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article.
This is a simple introduction to the <em>mapreduce</em> usage in Greenplum 4.1.</p>
]]>
        <![CDATA[<h2>What is mapreduce exactly?</h2>

<p>Mapreduce's main goal is to process highly distributable problems across huge datasets using a large number of computers (nodes).
As you may understand, this suits perfectly with Greenplum, which is at ease with huge distributed datasets and allows the integration
with the SQL language.</p>

<p>Mapreduce consists of two separate steps: <em>Map</em> and <em>Reduce</em>.</p>

<h3>Map step</h3>

<p>During this step, the main problem is partitioned into smaller sub-problems that are passed to children nodes, recursively.
This process leads to a multi-level tree structure.</p>

<h3>Reduce step</h3>

<p>During this step, all the sub-problems solutions are merged to obtain the solution to the initial problem.</p>

<h2>Install Mapreduce in Greenplum</h2>

<p>Do you think installing Mapreduce in Greenplum is a difficult task? The answer is no. Mapreduce is already included in Greeplum!</p>

<h2>Let's get practical</h2>

<p>I assume that you have a Greenplum 4.1 system installed.</p>

<p>Run <code>gpmapreduce --version</code>:</p>

<pre><code>$ gpmapreduce --version
gpmapreduce - Greenplum Map/Reduce Driver 1.00b2
</code></pre>

<p>Perfect, we can go on.</p>

<p>The main point here is a specially formatted file, that we will convetionally call it <code>test.yml</code> from now on.</p>

<p>As you may guess, that is a <code>YAML</code> file, which defines all parts that are needed by the mapreduce data flow to complete:</p>

<ul>
<li>Input Data</li>
<li>Map Function</li>
<li>Reduce Function</li>
<li>Output Data</li>
</ul>

<p>The Greenplum MapReduce specification file has a specific YAML schema. I invite you to have a look at the AdminGuide for details.
In particular, MapReduce is handled in Chapter 23.</p>

<p>For the sake of this article, we will focus on function definitions.</p>

<p>Let's start writing a text file named <code>test.yml</code> with the mandatory header:</p>

<pre><code>%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: dbname
USER: gpadmin
HOST: host
</code></pre>

<p>where <code><dbname></code> and <code><host></code> are the name of the database and the host where MapReduce will connect to.</p>

<h3>Input Data</h3>

<p>Input Data can be obtained in so many ways, in this example we will use an SQL <code>SELECT</code> statement.
Let's create a table in database <code><dbname></code> to get data from:</p>

<pre><code>$ psql -c "CREATE TABLE mydata AS SELECT i AS x,
     floor(random()*100) AS y FROM generate_series(1,5) i" <dbname>
</code></pre>

<p>This will create a 5 rows table with this structure:</p>

<pre><code><dbname>=# d mydata
         Table "public.mydata"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 x      | integer          | 
 y      | double precision | 
Distributed by: (x)
</code></pre>

<p>The set of rows of this table is our Input Data.
Let's define it in the MapReduce configuration file, by appending this to <code>test.yml</code>:</p>

<pre><code>DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM mydata
</code></pre>

<p>That is self-explanatory, it just selects all rows from the <code>mydata</code> table as input data for mapreduce.</p>

<h3>Map Function</h3>

<p>It is very important to understand that a Map function takes as input <em>a single row</em>, and produces <em>zero or more</em> output rows.
Map functions can be written in C, Perl or Python.
They reside directly in the YAML configuration file.</p>

<p>Parameters managment varies between programming languages (please consult AdminGuide for details).
Let's see an example of a map function written in Python. You can append the following to <code>test.yaml</code>:</p>

<pre><code>- MAP:
        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [x integer, y float]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }
</code></pre>

<p>As you can see, function source is placed directly in the YAML configuration file.
The function takes <code>x</code> and <code>y</code> as input and returns (<code>yield</code>) <code>x</code> and the sum of <code>x</code> and <code>y</code>.</p>

<h3>The Reduce step</h3>

<p>Reduce functions takes a set of rows in input and produces <em>a single</em> reduced row.
There are several predefined functions included in Greenplum.</p>

<p>Here's the list:</p>

<ul>
<li>IDENTITY - returns (key, value) pairs unchanged</li>
<li>SUM - calculates the sum of numeric data</li>
<li>AVG - calculates the average of numeric data</li>
<li>COUNT - calculates the count of input data</li>
<li>MIN - calculates minimum value of numeric data</li>
<li>MAX - calculates maximum value of numeric data</li>
</ul>

<p>Let's apply a REDUCE function to our input data, so append this at <code>test.yml</code>:</p>

<pre><code>EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM
</code></pre>

<p>This return values unchanged. It is not very useful practically, but it is enough to show the Reduce step in action and get you started.</p>

<p>Ok, let's see the complete <code>test.yml</code>:</p>

<pre><code>%YAML 1.1
---

VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
    - INPUT:
        NAME:  my_input_data
        QUERY: SELECT x,y FROM my_data

    - MAP:

        NAME: my_map_function
        LANGUAGE: PYTHON
        PARAMETERS: [ x integer , y float ]
        RETURNS: [key text, value float]
        FUNCTION: |
                yield {'key': 'Sum of x', 'value': x }
                yield {'key': 'Sum of y', 'value': y }

EXECUTE:
    - RUN:
        SOURCE: my_input_data
        MAP: my_map_function
        REDUCE: SUM

</code></pre>

<p>Remember that YAML does not use TABS!</p>

<p>It is now possible to execute this Mapreduce job simply running:</p>

<pre><code>$ gpmapreduce -f test.yaml
</code></pre>

<p>Results here will most likely be different from yours, due to the usage of the <code>random()</code> function during data generation.
Here's mine:</p>

<pre><code>mapreduce_2508_run_1
key     |value
--------+-----
Sum of x|   15
Sum of y|  278
(2 rows)

</code></pre>

<p>Exactly the sum of all x and y values from input table <code>mydata</code>.</p>

<p>In conclusion, this is just a smattering of how MapReduce works in Greenplum.
MapReduce is a complex and wide topic, and its usage is growing in popularity every day.</p>

<p>Greenplum has an excellent support of it and allows business analytics users to take advantage
of the shared nothing architecture by executing map/reduce functions in a distributed way and by
working on distributed datasets.</p>
]]>
    </content>
</entry>

<entry>
    <title>Apple&apos;s Lossless Audio Codec and Software Patents</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/10/apples-lossless-audio-codec-an.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.179</id>

    <published>2011-10-28T12:07:45Z</published>
    <updated>2011-10-28T12:17:33Z</updated>

    <summary> Today my mailbox was crowded with some Apple news. The source code to the Apple Lossless Audio Codec, encoders and decoders, was released to the world. A few open-source projects such as FFmpeg and FAAC has already had support...</summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlapplepatents" label="postgresql apple patents" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[


	
	
	
	<style type="text/css">
	<!--
		@page { margin: 0.79in }
		P { margin-bottom: 0.08in }
		A:link { so-language: zxx }
	-->
	</style>

<p style="margin-bottom: 0in">Today my mailbox was crowded with some
Apple news.  The source code to the Apple Lossless Audio Codec, encoders and decoders, was <a href="http://alac.macosforge.org/">released to
the world</a>.  A few open-source projects such as FFmpeg and FAAC has
already had support for the resulting .m4a file; this gives them an
official release to validate against.  As a well documented audio
snob, I'm known to be a fan of lossless audio, with a few hundred CDs
here ripped into FLAC.  Unfortunately, this source code drop doesn't
change anything for me, and that's all because of software patents. 
Right now I consider such patents to be the greatest risk to software
I work on like PostgreSQL, so I spent some time looking into just
what Apple has done here, as a data point on that topic.</p>

<p style="margin-bottom: 0in">If your baseline is MP3 files, Apple
Lossless appears to be a step forward as far as licensing goes.  All
MP3 playback requires a patent license, while playback of Apple's
format does not--only the encoding side need be licensed.  That's
not my baseline though.  FLAC has specifically avoided using any
known patented technology, so it's the clear winner in being clean of
patent issues.</p>
<p style="margin-bottom: 0in">You can't even read the just released source code without being shown the door that leads to Apple's patents.&nbsp; <a href="http://alac.macosforge.org/trac/browser/trunk/codec/ALACEncoder.cpp">ALACEncoder.cpp</a> pushes you that way when you read it, saying "The
relevance of the ALAC coefficients is explained in detail in patent
documents."  So you can't fully understand what this code does
unless you dip into the patent description.  That's a big sign of
trouble.&nbsp; I'm not sure exactly which patent they are referring to; it may be <a href="http://www.freepatentsonline.com/y2008/0027709.html">Determining scale factor values in encoding audio data with AAC</a>.
</p>

<p style="margin-bottom: 0in">I'm not sure because I don't read patent descriptions if I
can avoid it, due to how&nbsp;<a href="http://en.wikipedia.org/wiki/Treble_damages">damages are tripled</a> with willful infringement.  That rule puts
open-source developers in a weird place.  The act of researching
which patents your free implementation might infringe on can have a
wildly negative return on investment.  If it's impossible to
implement the idea without infringing in the patents you find, which
can easily be the case given the ridiculously low bar for grating
such patents, if a lawsuit does happen you'd be better off not
knowing about that risk when it starts.  On the PostgreSQL mailing
lists, mentioning how a patented implementation of something works is
one of the few things that will get you a warning and potentially
blocked from the lists.  Knowingly implementing a patented idea can
be a very expensive mistake for the project to make.</p>

<p style="margin-bottom: 0in">Now, you can claim this problem has
gone away for this bit of software due to how Apple has released this
code under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache 2.0 license</a>.  If you build something using the Apple Lossless
code, and you distribute the result under the Apache 2.0 license, you
get the protection of its Grant of Patent License.  You'll only lose
it if you sue someone in a way that involves this code+patent
combination.  That doesn't necessarily protect manufacturers of
hardware who want to include this capability, as they may not want to
license the result that way.  I don't really care about them though. 
People who want to take advantage of open-source code but not
contribute back to can worry about their own legal issues, they're
not my concern.</p>

<p style="margin-bottom: 0in">My real issue with using this format is
that it implicitly approves of business practices around software
patents I find hostile to my own work, and that leads me back to the
patent-less FLAC again.  Apple used to more often be the victim of
valid and frivolous patent claims, but increasingly they've become
the originator of them instead.  In mid 2006, Apple was sued over
patent issues by Creative Technology; they settled paying Creative
$100M dollars.  Since then, patent issues have increasingly been part
of their <a href="http://en.wikipedia.org/wiki/Apple_Inc._litigation#Patent_infringement">well exercised legal arm</a>.</p>

<p style="margin-bottom: 0in">Starting last year, Apple has seriously
turned around in how it handles patents--it's now aggressively
enforcing its patents rather than just being sued for violations.  Steve
Jobs issued a warning shot about that, saying "competitors should
create their own original technology, not steal ours". 
Considering it had only been four years before then that Apple was
sued--and paid out in a big way--for stealing other people's
patented ideas, that came off as rather hypocritical to me.  
</p>

<p style="margin-bottom: 0in">The company has done well staying ahead
of its competitors by production innovation, rather than court
fights.  iPod competitors failed to gain traction because their
products weren't as good, not because they couldn't steal the design.
 Apple's iPad should have the same property; the knock-offs are not
selling because they're not as good.  The only real threat to their
product line right now are how Android phones are displacing the
iPhone.  Dumping so much money into offensive lawsuits is burning up
money that could be used for real product advances there instead. 
It's a shame they've resorted to this tactic.</p>

<p style="margin-bottom: 0in">At this point, a cautious person would
avoid using technology encumbered by Apple's patents, as they clearly
have the means and intent to sue for violations.  And someone who
values open source projects should avoid patented approaches even
when they are freely licensed.  Whether or not individuals using
Apple Lossless is particular are exposed to problems here is missing
the big picture.</p>

<p style="margin-bottom: 0in">As an advocate for free software, I
reflexively pick the less patent encumbered approach to any problem,
using that as the tie breaker for decisions that are otherwise even. 
Encoding into and helping popularize Apple Lossless may be legal now.
 But I'll keep encoding into FLAC, copying the result onto my Sansa
Fuse player, and avoiding their entire music ecosystem.  Software
patents are too dangerous to implicitly endorse them if it can be
avoided, and here they easily can.</p>

 ]]>
        
    </content>
</entry>

<entry>
    <title>Sync rep scaling</title>
    <link rel="alternate" type="text/html" href="http://blog.2ndquadrant.com/en/2011/10/sync-rep-scaling.html" />
    <id>tag:blog.2ndquadrant.com,2011:/en//3.177</id>

    <published>2011-10-26T16:31:04Z</published>
    <updated>2011-10-26T16:38:51Z</updated>

    <summary><![CDATA[﻿I'm almost done with this year's crazy conference season schedule, just Brazil's PG.BR next week left.&nbsp; All of my recent presentations are now available at our talks page.&nbsp; You'll also find many of the talks from our CHAR(11) conference this...]]></summary>
    <author>
        <name>Greg Smith</name>
        <uri>http://www.2ndQuadrant.us/</uri>
    </author>
    
        <category term="Greg&apos;s PlanetPostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
        <category term="PostgreSQL" scheme="http://www.sixapart.com/ns/types#category" />
    
    <category term="postgresqlreplication" label="postgresql replication" scheme="http://www.sixapart.com/ns/types#tag" />
    
    <content type="html" xml:lang="en" xml:base="http://blog.2ndquadrant.com/en/">
        <![CDATA[﻿I'm almost done with this year's crazy conference season schedule, just Brazil's <a href="http://pgbr.postgresql.org.br/">PG.BR</a> next week left.&nbsp; All of my recent presentations are now available at our <a href="http://www.2ndquadrant.com/en/talks/">talks page</a>.&nbsp; You'll also find many of the talks from our CHAR(11) conference this summer there for the first time.&nbsp; Those were unavailable for a while due to an unfortunately timed web site change.<br /><br />I like to do some original research for my talks, and this year that included a look into <a href="http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/SyncRepDurability.pdf">Synchronous Replication and Durability Tuning</a> in PostgreSQL 9.1, specifically the performance side.&nbsp; At last week's PG.EU I gave an updated version of this talk (with Simon Riggs), including a bit more info than I had available during the Postgres Open presentation on the same topic.<br /><span class="mt-enclosure mt-enclosure-image" style="display: inline;"><a href="http://blog.2ndquadrant.com/en/clients-3.html" onclick="window.open('http://blog.2ndquadrant.com/en/clients-3.html','popup','width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0'); return false"><img src="http://blog.2ndquadrant.com/en/clients-3-thumb-320x240.png" alt="clients-3.png" class="mt-image-none" style="" height="240" width="320" /></a></span><br />The good news/bad news performance results are nicely summarized by the "Group commit magic" graph on slide 17.&nbsp; There I was replicating across the Atlantic Ocean, crossing from my home here in Baltimore over to the conference site in Amsterdam.&nbsp; When doing synchronous replication, the speed of light determines how fast any single client can commit.&nbsp; You can only reach about 50% of that round trip time given current (and expected future) technology here.&nbsp; The efficiency rate of the fiber-optic cables used is not perfect, and tasks like routing add some overhead too.
<br />&nbsp;<br />With that sort of distance, the round trip time is at least 100 milliseconds.&nbsp; What this means is that no one client can commit more than 10 times per second.&nbsp; This shows just how difficult sync rep is to run well, the maximum rate drops fast if you expect to leave your local data center to commit.&nbsp; Light turns out to be really slow compared with what many people expect.
<br />&nbsp;<br />The great thing I proved in that slide is that the efficiency when multiple clients are committing at the same time scales almost linearly.&nbsp; If you want 1000 transactions/second over this sort of distance, you can get that--but you'll need just over 125 clients going at once to do it.&nbsp; Each one of those clients will be seeing 10 commits/second and 100 millisecond latencies (and higher for a small percentage, peak latency in the test was closer to 600ms).&nbsp; But each commit reply will be acknowledging a pile of clients at once, just by sending a small packet with an updated "committed up to this point" response.
<br />&nbsp;<br />When the speed of light turns out to be your bottleneck, there's not much you can do about that attacking directly.&nbsp; Some people who want to reach higher rates might architect their systems with many smaller clients, as I've shown in this example.&nbsp; I was able to reach my goal of 2000 INSERT commits/second here, but it took 275 clients to get there.
<br />&nbsp;<br />The other great thing about how PostgreSQL implements sync rep is that it's controlled per transaction.&nbsp; Once you see how expensive the commits are, if that's too high for some of your data, you can always tweak that for higher performance just by disabling sync rep for some transactions.&nbsp; Having such fine-grained control over synchronous commits is a unique feature to PostgreSQL, allowing something like a Quality of Service suggestion to the database.&nbsp; The PostgreSQL code as of 9.1 really has an unprecedented range of trade-offs here.&nbsp; You can go for faster but not very durable at all (with unlogged tables), locally durable but not guaranteed to a remote data center at a medium speed, all the way up to fully synchronous and very expensive to commit.&nbsp; It's possible to argue that other database choices are better at one end of this range.&nbsp; You might use MongoDB for higher speeds at the low durability range, and Oracle for their better tested sync rep capabilities (I saw better tested simply because the PostgreSQL 9.1 code is very new relative to Oracle's implementation).<br /><br />PostgreSQL started in the middle here, and with 9.1 it's expanded nicely toward both ends of the spectrum at once.&nbsp; It's now providing options for higher and lower durability at the same time, in one database, and with the speed/durability trade-off adjustable for every transaction.&nbsp; Building one size fits all software is really hard, and the new features in 9.1 nicely push out capabilities here for several popular use cases, all at the same time, and only when you want to pay for them.<br />]]>
        
    </content>
</entry>

</feed>

