2ndQuadrant, Professional PostgreSQL

2ndQuadrant Ltd official blog

JDBC is the driver used to access a database with Java. Greenplum has a full working JDBC implementation. In this short article we'll see how to use it.

With an announce on the forum, Greenplum staff has spoke out about the new version of their Database Management System. I can't resist to blog about some of its new features.

In the previous article we have seen how to install Greenplum on multiple nodes. After installation steps, we must init the entire system. Let's see how.

One of the main advantages using Greenplum is that it gains power when it uses multiple nodes. Horizontal scalability is a main feature of Greenplum.

Here is a compact handbook to install a multi-node Data Warehouse environment with Greenplum.

The fifth edition of the Italian PGDay went well beyond our initial expectations. We had about 75 participants, a total of 95 people including staff and speakers.
As I said during the event, rather than PGDay Italy, this should be named PGDay for Italian speakers given the presence of staff from Switzerland (Canton Ticino). Participants came from 12 regions: all regions but Val d'Aosta in the north/centre area, but also from Southern Italy (Naples and Calabria).

Greenplum does not officially support Ubuntu Server 11.10 as underlying operating system. However, I needed to install it on the most recent Ubuntu server just to perform some tests and evaluate it.

Through this article, we are going to complete the MapReduce job started in the previous article.

This year's conference lineup led me all over the world, a giant rectangle triangle going from the west coast of the US, north to Canada, east to the UK and Amsterdam, then ending south in Brazil.  I've now locked myself in to focus on the 3rd CommitFest for PostgreSQL 9.2, which began a few days ago.  Check out the 2011-11 section of the CommitFest tracker to see what changes have been submitted, we're always looking for new volunteer patch reviewers.

Talking to people deploying PostgreSQL in several countries during a short span of time has given me some interesting perspective on where the project is at.  I follow a lot of adoption trends in the US, and some of those I assume are quirks in how business is done in this country.  But when I hear the same sort of feedback from people in all four of the countries I've been to this year, too, it's clear this is a larger issue.

The first thing I'm seeing a surprising amount of is satisfaction with the feature set in PostgreSQL.  A few years ago, conversations about what you could and couldn't do with PostgreSQL usually stalled on one of a few common requests.  There's a good survey of PostgreSQL feature feedback at User Voice.  13 important features originally on that list have been closed already, with Index Only scans as the next expected to fall in the upcoming 9.2.  PostgreSQL now includes regular and synchronous replication as of 9.1.  pg_upgrade has been getting an increasing amount of testing that proves it works for many in-place upgrade scenarios.  Extensions are dramatically easier to use now.

It seems the total feature set has crossed the threshold where PostgreSQL is good enough for a whole lot more deployment situations than it used to be.  What I'm hearing from people all over the world now is that the basic feature set and performance of PostgreSQL isn't failing the "checkbox test" so often anymore, where business people require certain things before they'll even consider a database.  There are some major wants that are some distance off, such as materialized views and better OLAP support (cube/rollup/etc.)  And using partitioning for bigger data sets is harder than people would like.  But these are all things that are only needed for larger deployments, and some workarounds exist if you're willing to work at them.

If the feature set isn't holding back as many deployments now, what is?  Well, the next thing I've been hearing everywhere is on that survey list too:  better administration and monitoring tools.  You really need a whole open-source stack to monitor PostgreSQL right now, from OS+database trending to query log analysis.  It's fine for these tools to live outside the database core, but some changes are clearly needed to make such tools easier to write.  For example, the one built-in tool that allows query monitoring is pg_stat_statements, and the limitations preventing it from being useful to most people are so obvious we've gotten two submissions to improve it in the last month.

There are a few projects that aim at the monitoring/administration problem.  EnterpriseDB's PostgreSQL Enterprise Manager, Cybertec's pgwatch, OmniTI's Reconnoiter, the suite of smaller tools from End Point, and even the text UI of pg_statsinfo all hit the edges of this problem.  What I hear when I have my advocacy hat on is that the community needs a major open-source project bigger than any of these to make database monitoring easier.  That's now one of the major distinguishing features the commercial competition has.  Getting enough of the people developing in this area all pointing in the same direction and working together is a big challenge though.

On a related note, now that the underlying features are there, it seem making replication easier to monitor and setup is a major issue too.  There are so many choices in replication technologies available for PostgreSQL it's easy for new people to get overwhelmed by them all.  And the documentation guides around this area are still filled with a lot of complications that aren't even really necessary to get started at this point.  It's easy for newcomers get dragged into details like how old style archiving works as a precusor to setting up even basic replication, despite that they're using the easier features in the current PostgreSQL instead.  This area still has some work in the core database happening in 9.2, and it will be important for the community to create replication guides that include current information covering both 9.1 and that release.  What I'm hear from every country I visit now is "I need material to help me compete against the idea of using Oracle RAC".

The last of the global trends that have really jumped out at me is how companies everywhere are reinventing the development process around database applications.  In some places, mostly bigger companies and government installations in particular, the expected staff "stack" is business as usual; it hasn't changed in a long time.  New applications go from Developer to DBA to systems administrator.  Management ideas like DevOps are catching on to improvement interface between these roles, but not really upset its basic structure.  Everywhere I go now, I'm seeing everything but the developer role being squeezed out.  ORM-driven development is eliminating the DBA's role in database design.  Managed application hosting platforms are wiping out the systems administrator role.  Startups with an idea for a web application go right from developer to deployment, and happily this is increasingly happening with a PostgreSQL backend in the database role.  There isn't even the perception that DBA-like help might be needed until the application grows quite a bit.  I'm seeing the need for better database specific optimization skills than a typical developer has being deferred until the application has tens of gigabytes of data to sling around.

Being able to deploy small PostgreSQL installs and grow them to a reasonable size without specialized DBA knowledge is a great thing as far as I'm concerned.  The exact advances in things like ORMs that have allowed reaching this point across the world are a topic that deserves its own long discussion.  I'm going to cut this off here and return to that later.  In this country, there's some concrete work around the 9.2 release that needs to get done this month.

Scenario:

We have a remote datasource, served by a gpfdist server. We need to import the data in a Greenplum database, while performing some ETL manipulation during the import.
It is possible to accomplish this goal with a simple transformation in a few steps using Kettle.

Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004. It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article. This is a simple introduction to the mapreduce usage in Greenplum 4.1.