Through this article, we are going to complete the MapReduce job started in the [previous article](https://blog.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html).
Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004.
It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article.
This is a simple introduction to the _mapreduce_ usage in Greenplum 4.1.
In this article, I am going to upgrade a Greenplum cluster from version 4.0 to 4.1 using `gpmigrator`.
`gpmigrator` is an utility shipped with Greenplum Community Edition whose purpose is to perform a live upgrade of an existing database.
I'm going to demonstrate how it is possible to use dblink in Greenplum 22.214.171.124
In this article we are going to show you how to write PL/Java functions in Greenplum.
I assume that you have a working Greenplum (or Greenplum Community Edition) at your disposal. In this example we will use version **4.0.4**, installed in /usr/local/greenplum-db-126.96.36.199 (which is the default location).
[*MADlib*](http://madlib.net) is an open-source library for scalable in-database analytics which targets the PostgreSQL and the Greenplum databases.
MADlib version 0.2beta needs to be installed properly to follow this article, so we encourage you to read the [official documentation](https://github.com/madlib/madlib/wiki/Installation-Guide-%28v0.2beta%29) to install it in a Greenplum database.
I'm going to show you how to perform Association Rules using MADlib and Greenplum.
The Python and the Elephant
".This 4-hour workshop
will take place on Thursday June 23 and will cover the two main
techniques for writing applications in Python for PostgreSQL: standard
client applications using PsycoPG or internal extensions using the
PL/Python language for stored procedures.
During EuroPython 2011, the major annual event for Python developers and
users in Europe, 2ndQuadrant will deliver a special hands-on training
session entitled "
One of the challenges when dealing with a new database design is that you don't know things like how big the tables will end up being until they're actually populated with a fair amount of data. But if the design has to factor in the eventual scalability concerns, you can't deploy it to obtain that data until the estimation is done. One way around this is to aggressively prototype things. Use staging hardware for this purpose that new applications can live on temporarily while sorting details like this out. You can just factor in that you'll need to move the app and possibly redesign it after a few months, when you have a better idea what data is going to show up in it.The other way to get around this chicken/egg problem is to write a data generator. Construct
One of the ugly parts of Linux with PostgreSQL is that the OS will
happily cache up to around 5% of memory before getting aggressive about
writing it out. I've just updated a long list of pgbench runs showing how badly
that can turn out, even on a server with a modest 16GB of RAM. Note that I am intentionally trying to introduce the bad situation here, so this is not typical performance. The workload that pgbench generates is not representative of any real-world workload, it's as write-intensive as it's possible to be.Check
out test set 5, which is running a stock development version PostgreSQL 9.1. Some the
pauses where the database is unresponsive during checkpoints, as shown
by the max_latency figure there (which is in milliseconds), regularly
exceed 40 seconds.&
One of the more useful bits of
PostgreSQL documentation I ever worked on is Tuning Your PostgreSQL
Server. When that was written in the summer of 2008, a few months after the
release of PostgreSQL 8.3, it was hard to find any similar guide that
was both (relatively) concise and current. Since then, myself and
many other PostgreSQL contributors have helped keep that document up
to date as changes to PostgreSQL were made.
The interesting and helpful trend
during that period is that parameters keep disappearing from the set
of ones you need to worry about. In PostgreSQL 8.2, there was a long
list of parameters you likely needed to adjust for good performance
on a PostgreSQL server: shared_buffers, effective_cache_size,
checkpoint_segments, autovacuum, max_fsm_pages,