Greenplum on GCP Updates

Google added a second generation of instance (machine) type used by Greenplum in the GCP Marketplace. Previously, Greenplum was available with n1-standard-8, n1-highmem-8, n1-standard-16, and n1-highmem-16 instance types. These have 8 and 16 vCPUs. Starting in 2021, the N2 series is now available.


Instance SeriesStandard (per vCPU per Month)Highmem (per vCPU per Month)
N2 Standard is 14.4% and N2 Highmem is 21% more expensive than N1 Standard and Highmem.

Disk Performance

Just like AWS, GCP puts artificial limits on disk throughput based on the Instance Type and also disk type. Also, the larger the VM, the higher the disk throughput limits but these limits are not linearly correlated to the cost of the VM. In other words, you may double the size of the VM in terms of vCPUs, memory, and cost but that doesn’t mean you will get double the disk throughput.

As you can see in the above chart, the N2 series offers higher write performance when compared to the respective N1 instance type. The read performance remains the same.


Instance SeriesN1 RAMN2 RAM


N2 uses Cascade Lake CPUs with a base frequency of 2.8 GHz and a sustained all core turbo of 3.4 GHz. This newer generation is touted by Google as being faster and even with a slightly higher price, a 20% price-performance improvement.

Greenplum Performance

Using the TPC-DS benchmark, I compared n1-highmem-8 with 2TB SSD disks with n2-highmem-8 also with 2TB SSD disks. The performance increase was dramatic.

Data loading improved by 17%, single user queries improved 31%, and concurrent user queries improved 31%. This was all done by simply changing the Segment nodes from N1 to N2.

Changing to N2

If you deploy a new cluster today in GCP, you will see the N2 series available in the GCP Deployment Manager interface. If you have an existing cluster in GCP, you can switch to N2 with the gpcompute utility. Note, you may need to first update your cloud utilities with gprelease.

Try it Out!

Greenplum on Azure Updates

TLDR; It is faster.

Proxy Interconnect

Greenplum 6 now has a third Interconnect type which is “Proxy”. This new type was designed specifically for running in Azure because the software defined network in Azure is optimized for TCP traffic while Greenplum uses UDP via UDPIFC.

Note: Greenplum does support a TCP Interconnect but it does not have the optimizations that UDPIFC and Proxy have. It will likely go away with Greenplum 7 too.

The Proxy Interconnect is automatically configured when you deploy Greenplum 6 via the Azure Marketplace so you don’t have to configure anything.

The goal of the Proxy interconnect is to reduce the “Network Flows” in Azure which is does very well and helps performance.

Segments Per Node

With Greenplum 4 or 5 on Azure, there are only 2 primary and 2 data volumes per node. Adding additional segments actually hurts performance because of the network limitations in Azure.

With Greenplum 6 on Azure, which uses the Proxy Interconnect, there are 3 primary and 3 data volumes per node. The additional network performance gained from the Proxy Interconnect allows better utilization of the virtual machines in the cluster.

This is all configured automatically with the Azure Marketplace too.

Performance: Data Loading

Load time of 3 TB of TPC-DS data using Standard_H8 instance type, went from 41 minutes to 26 minutes. This is a 36.5% increase in performance!

Performance: 1 User Queries

Executing all 99 TPC-DS queries with 3 TB of TPC-DS data, execution time went from 78 minutes to 67 minutes. This is a 14% increase in performance!

Performance: 5 Concurrent User Queries

Executing all 99 TPC-DS queries concurrently in random order with 3TB, execution time went from 2 hours and 49 minutes to 2 hours and 25 minutes. This is a 14% increase in performance.

Azure Marketplace

Greenplum is available in Azure with either a Bring Your Own License (BYOL) model with a free 90 day evaluation period or billed Hourly.

This is just one aspect of running Greenplum in Azure. Check out the Release Notes and Overview docs in the Marketplace listings too. The documents cover the features in greater detail and covers things like snapshot backups, scaling storage, scaling compute, automated maintenance, automated upgrades, etc.

Resizing Greenplum in the Cloud

In the public cloud Marketplaces, you can scale compute and memory in your cluster independently of storage with gpcompute. This automates the process to increase or decrease the instance type size to increase/decrease the amount of CPU and RAM for each VM.

By scaling up, you will be able to handle higher concurrency. Scaling down, you can save money. So it is a great utility to handle the Monday morning traffic in the database and then scale down for the remaining part of the week. Or, you may have a new cluster with few users. As you get more adoption, you can scale up to handle the increase of demand.

For storage, you can scale up your storage independently of compute with gpgrow. This automates the commands to increase the storage size of your cluster. For AWS and GCP, this command is an online command which means the database is up and running while you grow your storage. For Azure, the cluster is temporarily paused while it grows the storage.

For on-premise installations, you typically scale by adding more nodes and then use the gpexpand utility. This is the most practical way to expand as you just rack more nodes and then expand the cluster to take advantage of the new nodes.

In the cloud, you can provision a new cluster rather quickly so quickly that it makes using gpcopy the better way to add more nodes to you cluster.

Using gpcopy

I have a small cluster in AWS with 8 nodes (64 segment cores) and now have decided to double the cluster to 16 nodes (128 segment cores). For my test, I first loaded 1TB of TPC-H data into my old cluster. This is the Source cluster.

Next, I deployed the new AWS cluster with 16 nodes using the AWS Marketplace which only took 17 minutes. This is the Target cluster.

gpcopy Configuration Steps

  • Deploy new cluster in a new VPC which requires configuring VPC peering. Note: If you are deploying the cluster to the same VPC, you won’t have to do these steps.
    • Update the Route Tables so the two VPCs could route to one another.
    • Update the Security Groups on the Target cluster so the Source cluster could communicate to it. Allow all traffic on UDP and TCP from the Source Security Groups to the Target Security Groups.
  • Install gpcopy on both clusters.
  • Stop Command Center on Target cluster.
  • On Target cluster, update gpadmin and gpmon users’ passwords to match the Source cluster.
  • On the Source cluster, update the gpadmin .pgpass file to have an entry for the private ip address of the Target cluster. Remember to use port 6432. Example:*:gpadmin:VZBQCFDxzZuWm
  • Run gpcopy on the Source cluster.
  • Run analyzedb on Target cluster.

gpcopy Command

gpcopy --full \
--source-host mdw \
--source-port 6432 \
--dest-host \
--dest-port 6432 \

gpcopy Results

[INFO]:-Total elapsed time: 15m10.61753093s
[INFO]:-Total transferred data 1.2TB, transfer rate 9.2TB/h
[INFO]:-Copied 2 databases
[INFO]:-Database gpperfmon: successfully copied 34 tables, skipped 0 tables, failed 0 tables
[INFO]:-Database dev: successfully copied 187 tables, skipped 0 tables, failed 0 tables
[INFO]:-Copy completed successfully


The Source cluster ran one of the TPC-H queries in 1 minute and 15 seconds. The same query on the new cluster ran the query in 25 seconds. Another example, a query ran in 1 minute and 59 seconds and now runs in 33 seconds.

Benefits of gpcopy

  • Validation that the new cluster performs as expected.
  • Automatic Rollback plan. If I don’t like it, I delete the new cluster and keep using the old one.
  • Fast.
  • Secure. The data transfers between VMs routed through your private network.

Final Thoughts

The elasticity of the cloud provides new solutions in solving old problems. How can I scale Greenplum? The old way is to add more nodes and run gpexpand. With the cloud, there are three answers.

  1. Scale compute with gpcompute
  2. Scale storage with gpgrow
  3. Scale nodes with gpcopy

Cloud Deployments – Details Matter

Cloud deployments need special attention because of the cloud environment itself and each cloud vendor needs unique attention. AWS is the most popular environment so this post will discuss some of the additional configurations that are needed and available in the Greenplum on AWS Marketplace product offerings.

Accelerated Networking

To enable Accelerated Networking, AWS requires either an Intel or Elastic Network Adapter (ENA) driver installed in your VMs depending on the Instance Type you are using. Not only is the driver needed, but you must modify the VM to enable the driver.

After the driver has been installed, you then must stop the VM and execute a CLI command to enable Accelerated Networking. After the VM starts again, you should verify that the drivers have been installed.

# Intel Driver
modinfo ixgbevf

#ENA Driver
modinfo ena

After this is done, you can create an Amazon Machine Image (AMI) so that new VMs you create have both the drivers and Accelerated Networking.

Elastic Block Storage (EBS) Timeout

Did you know that your VM may timeout when trying to communicate with a mounted EBS volume? When it does timeout, the operating system doesn’t know what to do either. The disk can’t be unmounted and you can’t use the disk at all. It isn’t a good situation to be in.

EBS may recover the disk properly but not your operating system may corrupt the disk. You can try to reboot but there isn’t a guarantee that the files will be intact. EBS didn’t technically fail either. Your VM just timed out communicating with EBS.

The default timeout is 30 seconds but can be increased up to 255 seconds. AWS recommends using the largest timeout possible to avoid this scenario too.

To alter this, you have to add nvme_core.io_timeout=255 to your kernel boot options and then restart the VM. Ideally, this setting is part of your AMI so that all VMs automatically get this setting so that you can avoid an EBS timeout.

Disk Settings on Reboot

Did you know that when you reboot a VM in AWS, the blockdev readahead and disk scheduler will get set back to the defaults? The consequence of this means much slower disk performance.

Greenplum on AWS Marketplace

The Greenplum on AWS Marketplace products have Accelerated Networking enabled, EBS Timeout at the max 255 seconds, and service on startup that automatically sets the optimal disk settings.

By using the Greenplum on AWS, Azure, or GCP Marketplace products, you can focus on using the database without having to worry with the tuning and configuration that is required for each cloud environment.

Backup & Restore

Backing up the Greenplum database was commonly performed by the gpcrondump utility but starting with Greenplum 5, this utility has been deprecated. You can still use it but it won’t be included or supported after Greenplum 5. Instead, you should use gpbackup to create database backups.

Greenplum 5

You still can find gpbackup with Greenplum 5 but it is not the latest version available. Instead, you should install the latest version of gpbackup to get the latest features and bug fixes.

Greenplum 6

The backup utility is no longer included with the database installer. This method of software delivery enables the Backup/Restore team to deliver software faster than if were bundled with the database installer.


This is so easy! Installation is done via a Greenplum package. For example:

gppkg -i pivotal_greenplum_backup_restore-1.18.2-1-gp6-rhel-x86_64.gppkg

Platform Extension Framework (PXF) – Azure Blob Store

Greenplum can read and write data to Azure blob store with PXF just like it can with AWS S3. Here is a quick demo on using Azure.

Step 1 – Start PXF

Run this as gpadmin on the Master (mdw) host.

pxf cluster start

Step 2 – Create Azure Storage Account

Log into Azure and navigate to Storage Accounts and create a new one.

Create New Storage Account

The Storage Account name must be unique within Azure. I picked “greenplumguru” for this demo but your name must be different.

Step 3 – Retrieve the Storage Account Key

After the Storage Account has been created, go to the resource and click on “Access Keys” to copy it.

Retrieve the “key”.

Step 4 – Configure PXF

On the master node (mdw), execute the following commands.

mkdir $PXF_CONF/servers/demo
cp $PXF_CONF/templates/wasbs-site.xml $PXF_CONF/servers/demo/
vi $PXF_CONF/servers/demo/wasbs-site.xml

Change YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME to the name of your Storage Account from Step 2. Change YOUR_AZURE_BLOB_STORAGE_ACCOUNT_KEY to the key value from Step 3.

Step 5 – Sync PXF

pxf cluster sync

Step 6 – Create a Container in your Storage Account

Click on Containers in your Storage Account
Create Container

Step 7 – Create Writable External Table

(id int, description text)
LOCATION ('pxf://')
FORMAT 'csv';

In the above example, I used the following:

  • “greenplumguru” for the Storage Account from step 2
  • “demo” as the name of the server configuration from step 4
  • “demo-container” which is the container name from step 6

Step 8 – Insert Some Data

INSERT INTO ext_write SELECT i, 'foo_' || i 
FROM generate_series(1,100000) as i;

Step 9 – Create Readable External Table

(id int, description text)
LOCATION ('pxf://')
FORMAT 'csv';

Step 10 – Select from External Table

SELECT * FROM ext_read LIMIT 10;

Platform Extension Framework (PXF) – S3

Greenplum has always used External Tables to access external data for query access. You can load this data into Greenplum formatted tables or even use as partitions in a partitioned table. The “location” of the External Table can be many things such as gpfdist (for files) or S3 or PXF.

PXF can access many formats like HDFS, HBase, Hive, JDBC, Google Cloud Storage, Minio, and AWS S3. This article will cover S3 but you can learn more about PXF here.

When you deploy Greenplum in AWS, an IAM::Role resource is created which enables many AWS features in the virtual machines deployed. You see this warning message bout the Role when you deploy.

IAM::Role acknowledgment

Because you have the Role by default in AWS, you can leverage these credentials immediately to access S3 via PXF. Here is how you do it.

Step 1 – Start PXF

Run this as gpadmin on the Master (mdw) host.

pxf cluster start

Step 2 – Create s3 bucket

Note: The bucket name must be unique.

Step 3 – Create Writable External Table

(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 4 – Insert Some Data

INSERT INTO ext_write SELECT i, 'foo_' || i 
FROM generate_series(1,100000) as i;

Step 5 – Create Readable External Table

(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 6 – Select from External Table

SELECT * FROM ext_read LIMIT 10;

As you can see, with relative ease, you can create readable and writable external tables using PXF in AWS by leveraging the existing credentials that are included with your Greenplum on AWS deployment. Enjoy!

Greenplum Scaling in the Cloud

Scaling On-Premise

An MPP database like Teradata, Redshift, and Greenplum have traditionally “scaled” by adding more physical nodes. The shared nothing architecture is great in that it gets near linear scalability. If you double the number of machines, you get nearly double the performance.

For Greenplum, adding more nodes is done with the help of a tool called gpexpand. This tool redistributes the data to newly added nodes. The database basically has to recreate the tables in the database to spread the data across all of the nodes in the cluster. This is a time consuming process but the database can be up while the expand occurs.

How long does it take? Well, you first have to buy more physical nodes, then rack it, run power and network. After all of that, you can finally run gpexpand which executes pretty quickly.

Scaling in the Cloud Option 1

With the cloud, provisioning a new VM is quick and easy so you can add more nodes and then run gpexpand. However, you still have to reshuffle the data which impacts the database while it is running. It also doesn’t give you a rollback plan either.

The easiest and safest option to add more nodes in the cloud is to create a new cluster and use gpcopy to move the data from the old cluster to the new one. This gives you the ability to validate that the larger cluster performs as expected before you switch over. It also gives you a rollback plan to go back to the old cluster. Once you are happy with the new cluster, you can delete it.

How long does it take? In the cloud, it takes no more than an hour to create a new cluster and transfer rates with gpcopy have been observed at 5TB to 10TB per hour.

Scaling in the Cloud Option 2

The other option in the cloud is available via the AWS, Azure, and GCP Marketplaces with a tool called gpcompute. This tool alters the instance type the cluster uses for the Segment Hosts to either increase or decrease the compute power. This executes in just a few minutes too.

The additional compute gives you the ability to handle higher concurrency workloads. You can then decrease the compute to save on the IaaS costs.

With gpcompute, you can better manage your IaaS costs while dynamically handling the demands on your database.

Cloud ProviderBring Your Own LicenseBilled Hourly

Untrusted Languages

Greenplum supports many different languages that can be executed in-database and in parallel. SQL and pgSQL are the most common but Perl, R, Python, and Java are also available.

Some of these languages provide means to interact with the operating system directly which can bypass security measures in the database. When these languages are created in the database, they are considered to be “Untrusted” where “Trusted” languages have certain commands removed so that the language can’t interact with the operating system.

Python is a very popular language and code can be written in Python using PL/PythonU in Greenplum. There isn’t a Trusted version of Python for PostgreSQL.

The “U” at the end indicates that it is in fact an Untrusted language. Again, this means a function written in this language can do nasty things to the operating system so be careful allowing this language to be used.

        import os
        return os.listdir('/home/gpadmin/')
LANGUAGE plpythonu;

This simple function can return data about the operating system but nastier functions can be created to do all sorts of things as a privileged user (gpadmin).


PL/Container is a language extension in Greenplum that allows the database to interact with Docker containers to execute code. This means you can use Untrusted languages such as Python, and have that code execute in containers rather than on the host operating system. This isolation makes the use of Untrusted languages safe to be used.

        # container: plc_python_shared
        import os
        return os.listdir('/')
LANGUAGE plcontainer;

Running this function will list the directory contents but of the container. This isolation protects the database cluster from being exploited by the use of the popular Python language.

So have fun with Python in Greenplum knowing you can use PL/Container to protect the cluster from “Untrusted” code.

Disk Games

I’ve been working in the cloud for a while now and have learned a lot about optimizing the disk performance. This quick block post will cover some of the things I’ve learned in the cloud.


On-premise installations are typically up 24×7 but in the cloud, you may frequently want to pause your cluster to save money. Greenplum on the cloud has gppower which automates the pause and resume of your cluster so you won’t have to mess with the blockdev command.

The installation guide for Greenplum will advise you to set the read-ahead value to 16384 for each data volume in your cluster.

/sbin/blockdev --setra 16384 $device

Did you know that on reboot, your operating system will reset this back to the default? So, you will need to set the read-ahead after every reboot. This is done automatically for you in the cloud but if you build your own, be sure to add some automation to do this after starting the nodes in your cluster.


In your /etc/fstab file, you can define the mounts like so:

/dev/sdb /data1 xfs rw,noatime,nobarrier,nodev,inode64 0 2
/dev/sdc /data2 xfs rw,noatime,nobarrier,nodev,inode64 0 2
/dev/sdd /data3 xfs rw,noatime,nobarrier,nodev,inode64 0 2

Unfortunately, you may run into problems with this configuration. The device order may change on reboot so suddenly, the disk that is supposed to be data1 is now data2. This will prevent the database from starting properly.

Instead, you should use the UUID which can be found in /dev/disk/by-uuid/ on each host.

UUID=46c6e872-20ca-4d50-9ed9-d98c4f662437 /data1 xfs rw,noatime,nobarrier,nodev,inode64 0 2
UUID=ed8b5c09-7427-455c-b132-f601d4f0b0d2 /data2 xfs rw,noatime,nobarrier,nodev,inode64 0 2
UUID=e92a5f17-3903-4c14-8e22-ebe218ab403f /data3 xfs rw,noatime,nobarrier,nodev,inode64 0 2

By using UUID, the device path may change but the UUID will persist. This ensures no problems in starting your database after the cluster was restarted.


This simple setting which is found in the /etc/fstab file, indicates to the operating system not to update the timestamp on a file if it is modified or accessed. There is considerable overhead added to using the database if you don’t add this mount option. In other words, it is faster to use noatime.


This mount option which is available in CentOS/Redhat 6 and 7 improves performance by not using write barriers. The overall increase in performance depends on your disk configuration but be advised that the disks need a battery backed disk write cache to ensure no data loss.


XFS has been the recommended disk format for over 10 years but did you know about the cool utility xfs_growfs that is included with this filesystem? This nifty tool will grow an existing xfs filesystem that is mounted. It is actually used in the Greenplum cloud utility gpgrow to grow the disk storage for your database in the cloud. It also can be executed online so you don’t have to stop the database to grow your storage.


The disk scheduler recommended for Greenplum is deadline. If you ever replace a disk for Greenplum, be sure to set the scheduler back to deadline. The default is likely cfq but this needs to be updated.

In the cloud, the gpsnap utility actually replaces your data volumes during a restore and will automatically set the scheduler properly but if you do a disk restore manually, be sure to set this for all disks replaced.

echo "deadline" > /sys/block/$devname/queue/scheduler

Number of Disks

Typically, when you add more disks to a node, you will get greater throughput but you will eventually reach the limit of the controller so adding more disks won’t mean more throughput.

For the cloud, things get a little bit more interesting. Each cloud vendor puts performance limits on each VM and compared to a bare metal installation, these limits are considerably lower.

So you may read about how a cloud vendor’s disk can provide 500MB/s of read performance so you might think with 4 disks, I could achieve 2000MB/s. However, that same cloud vendor may put a speed limit on each VM of only 500MB/s. So no matter how many disks you add to your cluster, you won’t get more throughput!

It is often better to use 2x to 8x more virtual machines that are smaller to get more overall disk throughput rather than using the larger instance types in the clouds. BTW, all of this work has already been figured out for the Greenplum cloud products. 🙂

Greenplum on Cloud Utilities

Greenplum on AWS, GCP and Azure have so many features that make it easier to run and manage Greenplum in the cloud.


gppower is a new utility that integrates with each cloud provider to automate the pause and resume of the cluster.

The pause command will stop Greenplum, Command Center, and even PXF before it then suspends the segment virtual machines. This is designed to save you money because the IaaS providers do not charge for the compute time of stopped VMs.

The resume command reverses this by starting the VMs and then starting Greenplum, Command Center and PXF.


This utility also integrates with the IaaS and allows you grow storage independently of compute. Each IaaS has different attributes in regards to this feature worth mentioning.

Growing the disk size is online for AWS and GCP. The database doesn’t even need to be stopped. For Azure, gpgrow will automatically stop and deallocate the VMs so that the disks can be modified. It will then start the VMs and Greenplum before expanding the disks.

gpmaintain & gpcronmaintain

This utility automates the common maintenance routines needed for Greenplum. The “cron” version makes it easier to schedule your maintenance. The configuration file is in /usr/local/greenplum-cloud/conf/.

The utility automates the execution of analyze and vacuum. It checks for bloated tables too.


This utility makes it easier to install optional components like Command Center. Just run it at the command line to see the list of optional components.


This utility automates the start, stop, pause, and resume of pgBouncer. It is intended to make it easier to use the load balancer.

gpsnap & gpcronsnap

This has been a popular utility! It automates the backup of the data volumes in your cluster for extremely fast backups. The backups are executed in parallel and can be used for disaster recovery.

Instead of copying the database to backup files, gpsnap relies on the IaaS disk snapshots to create a consistent backup of your database.

The “cron” version of the utility makes it easier to schedule the execution of gpsnap. The configuration file is in /usr/local/greenplum-cloud/conf.

gprelease & gpcronrelease

This utility automates software upgrades and when upgrades are available, the Message of the Day (the information displayed when you ssh to the master host) gets updated to let you know there is a new version available.

The utility not only upgrades the database but also upgrades dependent packages like MADlib and PostGIS.


Running Greenplum in the cloud is getting easier and easier because of utilities and automation that is available with Greenplum in the cloud. So be sure to check us out in your favorite cloud provider’s marketplace!

gpbackup and gprestore with S3

These tools replace the older gpcrondump and gpdbrestore utilities. A great feature for these newer tools is the ability to use S3 directly. This post will give a quick example on how to do this.

First, create a cluster and create a table to demonstrate the backup and restore.

psql -c "create table foo as select i from generate_series(1,1000000) as i distributed by (i);"

Next, create an s3_config.yaml file. You’ll need an Access key ID from AWS as well as the Secret Access key. This file should have the following:

executablepath: $GPHOME/bin/gpbackup_s3_plugin
  region: <your AWS region>
  aws_access_key_id: <your access key>
  aws_secret_access_key: <your secret access key>
  bucket: <S3 Bucket name>
  folder: <Path to store your backups in S3>

Next, you can execute a backup.

gpbackup --dbname gpadmin --plugin-config /home/gpadmin/s3_config.yaml 

Pretty simple, right?

To restore, you can follow these simple steps. (You will need to update the timestamp value for the backup you wish to restore.)

dropdb gpadmin
gprestore --plugin-config /home/gpadmin/s3_config.yaml --timestamp 20190823142028

Installing Greenplum with rpm

Pivotal Greenplum 5.x has two main installers. One is the older “binary” installer and the other is an rpm installer. With Greenplum 6.x, there will only be a single rpm installer. So, you need to get used to installing with rpm. This post will cover installing the Greenplum software on a cluster using rpm.

Step 1

First you need to install the binaries on your Master node. This is pretty simple. Copy the rpm file to you Master node, connect to the Master node as gpadmin and then install the software.

scp -i private_key greenplum-db.rpm gpadmin@master_node:/home/gpadmin/
ssh -i private_key gpadmin@master_node
sudo rpm -i greenplum-db.rpm
sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db*
echo "source /usr/local/greenplum-db/" >> /home/gpadmin/.bashrc
source /usr/local/greenplum-db/

Step 2

Note: This step assumes you have already exchanged keys across your cluster either manually or with gpssh-exkeys.

Copy the rpm file to all other nodes in the cluster and then run the rpm -i command again. You’ll need a file that lists all nodes in the cluster except the master.

for i in $(cat all_hosts_except_master.txt); do scp greenplum-db.rpm gpadmin@$i:/home/gpadmin/; done
gpssh -f all_hosts_except_master.txt "sudo rpm -i /home/gpadmin/greenplum-db.rpm"
gpssh -f all_hosts_except_master.txt "sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db*"


There are lots of ways of achieving this with automation tools but this is one easy way to install the rpm on all nodes in your cluster.

Managing Resource with Resource Groups

You might have missed the addition of Resource Groups to Greenplum Database version 5 but it is something worth looking at. It is an alternative to Resource Queues for managing resources but because of how it works, it makes the database faster!

Resource Queues prioritize and allocate resources and it does this at the process level in the operating system.

Resource Groups, on the other hand, rely on the OS Kernel to manage and prioritize resources. This alone, removes overhead of managing resources thus making it faster!

Resource Groups have more control over managing resources too so you can better control your resources. So not only is it faster, it gives you greater control over the cluster.

Implementing Resource Groups is easy too.

gpconfig -c gp_resource_manager -v group
gpstop -a 
gpstart -a

More information on Resource Groups are in the Admin Guide:

Fun with dblink and Functions

Greenplum includes several optional modules from PostgreSQL. One of these is dblink which enables connecting to a remote PostgreSQL or Greenplum database with SQL.

First, run this command as gpadmin:

psql -f /usr/local/greenplum-db/share/postgresql/contrib/dblink.sql


If you are superuser like gpadmin, you can use the connect function to a database in the same cluster without specifying the password like this:

PERFORM dblink_connect('blog_test', 'dbname=blog_db');

So now you can execute some SQL in the other database. It is also in a separate transaction which gives you the ability to have an autonomous transaction inside of a Function.

v_sql := 'CREATE TABLE foo (id int) DISTRIBUTED BY (id)';
PERFORM dblink_exec('blog_test', v_sql, true);
PERFORM dblink_disconnect('pws_test');

If you do all of this as a superuser, the function will execute the dblink successfully but if you grant EXECUTE on the function to a non-superuser, it will fail. Why? Because by default, the Function executes as the user that is executing the Function, not the owner.

The easy fix for this is to use SECURITY DEFINER when creating the function. This means the Function executes as the owner of the Function. You could have all sorts of code in your Function that requires a superuser to execute and grant it to a non-superuser. This includes dblink!


Have fun!

Greenplum Summit

I will be presenting at the Greenplum Summit which is held in New York City March 18-22. I’m really looking forward to the event and speaking with customers. If you attend, be sure to look for me and let me know you use this site!

  • March 18-22 in New York City
  • Use code 2019_GREENPLUM to get a 25% discount off any pass type when you Register
  • Hands On Training Pre-Conference

Greenplum Master High-Availability in the Cloud

Master Node
The Greenplum architecture uses a Master node for client connections. The Master also generates and dispatches the query plan to the Segments and then returns results back to the clients. For the most part, the Master node is relatively idle because the data is distributed across multiple segments on other hosts.

If the Master node were to fail, the database goes offline so to mitigate this risk and provide high-availability, Greenplum has a Standby-Master process that runs on a separate host. Just like the Master, the Standby-Master is usually pretty idle. Sometimes it is also used as an ETL node so you aren’t wasting money by having an extra, idle machine.

In a physical data center, a node replacement may take days so having a physical standby-master is typically a requirement. This isn’t a fault with Greenplum but a fault of the realities in replacing physical hardware.

Deployments in the Cloud, however, can replace a bad node in minutes. So do you really need a VM just for the Standby-Master process? Furthermore, what happens when you automate the node replacement so that the cluster “Self-Heals”?

Cloud Marketplaces
In the AWS, Azure, and GCP Marketplaces, Pivotal Greenplum is deployed with a single Master node and the Standby-Master process runs on the first Segment host. If any node in the cluster fails, the cluster Self-Heals. The process is different in AWS versus GCP and Azure but the concept is the same. A new VM is created and the commands to recover the cluster back to 100% are executed automatically.

You get to save money by not having a dedicated Standby-Master node and you still get the high-availability you want with Greenplum.

gpsnap Video

gpsnap is a utility currently available for Greenplum in AWS and Azure. It automates backup and recovery with AWS EBS snapshots and Azure Disk snapshots. For AWS, you can also copy these snapshots to a different AWS Region which is an ideal solution for Disaster Recovery.

Check it out!

Monitoring Sessions

I recently found out that one of my assumptions in Greenplum was completely wrong! After all of these years, I didn’t understand how pg_stat_activity worked!

I always thought that each row in this system table equated to a single session so, if I wanted to get the number of sessions logged in, I could do this:

select count(*) from pg_stat_activity;

But, this is wrong! One row actually means one server process and not one session.

When a session submits a query, pg_stat_activity may report multiple rows for that single session as it execute the query. Why? A single query can spawn multiple backend processes and these backend processes can show up in pg_stat_activity.

Example – Query 22 from TPC-DS
More Information on TPC-DS

select i_product_name
,avg(inv_quantity_on_hand) qoh
from inventory
where inv_date_sk=d_date_sk
and inv_item_sk=i_item_sk
and d_month_seq between 1212 and 1212 + 11
group by rollup(i_product_name
order by qoh, i_product_name, i_brand, i_class, i_category
limit 100;

First, look at pg_stat_activity and see that I have two sessions open. One is executing the simple query from pg_stat_activity (sess_id=1013) and the other is idle (sess_id=1011).

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
(2 rows)

Now, execute Query 22 from sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
   31027 |    1013 | select procpid, sess
   30804 |    1011 | select i_product_nam
   31137 |    1011 | select i_product_nam
(3 rows)

I now see that sess_id=1011 has started a second process to complete the query. It is still a single database session but pg_stat_activity now shows two rows for this session.

Even after the query is over, I can still see the idle process created for sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
   31137 |    1011 | <idle>
(3 rows)

So how do you get the number of sessions from pg_stat_activity? With this query:

gpadmin=# select count(*) from (
gpadmin(#     select sess_id from pg_stat_activity group by sess_id
gpadmin(# ) as sub;
(1 row)

Default Storage Options

When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.

Here are the defaults:

gpadmin=# show gp_default_storage_options;

And a quick test, shows how you can default to something else.

gpadmin=# set gp_default_storage_options='appendonly=true';                                                                                                                                               SET
gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i);
gpadmin=# \d foo
Append-Only Table ""
 Column |  Type   | Modifiers 
 i      | integer | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Distributed by: (i)

You can make the changes permanent with gpconfig.

gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'

And you can make it limited to a single database.

ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';

It’s a Snap!

gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.

EBS Snapshot
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.

EBS Restore
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.

gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.

The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete : deletes a specific snapshot
– gpsnap delete all: deletes all snapshots
– gpsnap restore : restores a specific snapshot

gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.

So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!

You can either Bring Your Own License (BYOL) or pay Hourly for Greenplum on AWS. Here are the links:
Pivotal Greenplum on AWS (BYOL)
Pivotal Greenplum on AWS (Hourly)

Pivotal Greenplum Version 5.x

Version 3.x
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.

Version 4.x
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.

Labs Methodology
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.

Version 5.x
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.

Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.

The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.

Version 6.0?
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.

AWS Marketplace

I haven’t posted much here lately because I’ve been working on deploying GemFire and Greenplum in the AWS Marketplace. It has been interesting work so I thought I would make a post about it here.

Amazon Web Services (AWS)
Amazon has the most popular cloud in the US with many products available which now includes Pivotal Greenplum and Pivotal GemFire. Both products are available as Bring Your Own License (BYOL) or Billed Hourly. BYOL simply means you already own a license and you wish to apply this to a deployment in AWS.

If you are not familiar with AWS, there are many ways to configure resources which can be overwhelming to learn and optimize. I have done extensive testing in these product offerings to ensure stability, security, repeatability, and performance. With just a few clicks, you can have an on-demand cluster built that rivals a full rack of on-premise hardware.

Greenplum and GemFire on AWS
Greenplum Billed Hourly
Greenplum Bring Your Own License (BYOL)
GemFire Billed Hourly
GemFire BYOL

Dropping Corrupt Tables in Greenplum

Greenplum is a Massively Parallel Processing database which means the data is distributed across multiple segments on multiple hosts. Each segment process stores a catalog of objects and it is possible to get this out of sync because of hardware failures or killing processes. For example, if you are vacuuming a table and then decide to kill the process on the Master host, the segments may not get the signal properly and rollback the transaction. This can happen when using “kill -9” rather than just “kill” to stop a process.

You may get this error message if a table is corrupt:

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

So you decide to drop the table and start over but that doesn’t work either.

gpadmin=# drop table corrupt;
ERROR:  table "corrupt" does not exist  (seg0 gpdbsne:40000 pid=78502)

The easy fix for this is to use “if exists” in the drop command.

gpadmin=# drop table if exists corrupt;
Time: 10.423 ms

Now the table is gone and you can recreate it. Next time, don’t use kill -9 and don’t kill vacuum commands.

More Details
If you are wondering how I managed to corrupt this table, here are the steps.
1. Create the table

gpadmin=# create table corrupt (id int, foo text, bar text) distributed by (id);
Time: 31.195 ms

2. Insert some data

insert into corrupt select i as id, 'foo_' || i, 'bar_' || i from generate_series(1,100) as i;
INSERT 0 100
Time: 46.419 ms
gpadmin=# analyze corrupt;
Time: 108.193 ms
gpadmin=# \q

3. Connect to a segment directly. I’m using a single node for this test.

[gpadmin@gpdbsne ~]$ ps -ef | grep postgres | grep M
gpadmin   78213      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/primary/gpseg1 -p 40001 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1
gpadmin   78214      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/primary/gpseg0 -p 40000 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0
gpadmin   78239      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/master/gpseg-1 -p 5432 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E

I see that the segments are using ports 40,000 and 40,001 while the master is using 5432. I’ll connect to seg0 on port 40,000 and then drop the table. This only drops the object in this one segment. I’ll also do a quick query to make sure I only see half of the data because I have 2 segments.

PGOPTIONS='-c gp_session_role=utility' psql -p 40000
psql (8.2.15)
Type "help" for help.

gpadmin=# select count(*) from corrupt;
(1 row)

Time: 3.485 ms

4. Corrupt the database by dropping the table only on one segment.

drop table corrupt;
Time: 9.410 ms
gpadmin=# \q

5. Confirm that the table is corrupt.

[gpadmin@gpdbsne ~]$ psql
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

And again, the fix:

gpadmin=# drop table if exists corrupt;
Time: 10.423 ms

S3 External Tables

S3 is Amazon’s Simple Storage Service which is an inexpensive cloud storage solution and has quickly become a solution for cold data and backups. Greenplum now has External Tables that can read and write data to S3 so you can leverage this popular storage service with Greenplum. Here is how you do it!

1. You will need an Amazon account with your Access Key ID and Secret Access Key. If you have the aws CLI installed and configured, just cat ~/.aws/credentials

2. Get your default region. This is in ~/.aws/config

3. You’ll need a bucket in this region and this can be done with the AWS web interface.

4. You’ll need a configuration file. Here is an example (be sure to change the secret and accessid).

secret = <secret>
accessid = <accesssid>
threadnum = 4
chunksize = 67108864
low_speed_limit = 10240
low_speed_time = 60
encryption = true

5. Copy the configuration to every Segment Host in your cluster.

for i in $(cat segment_hosts.txt); do scp s3_demo.conf $i:/home/gpadmin; done

Writing to S3

1. Create the Writable External Table with the S3 protocol, AWS URL that has the correct region, and the configuration file that is found on every Segment Host.

(id int, fname text, lname text)
LOCATION ('s3:// config=/home/gpadmin/s3_demo.conf')

2. Execute an INSERT statement:

INSERT INTO demo_write 
SELECT i, 'Jon_' || i, 'Roberts_' || i
FROM generate_series(1,10000) as i;

Note: Each Segment will create a file in S3 in your bucket with the prefix you specify in the location specified in the Writable External Table. In this demo, each file is prefixed with “demo”. An example filename is “demo9767abbb3.txt”.

Reading from S3
1. Create the External Table with the same location and configuration as before.

CREATE EXTERNAL TABLE public.demo_read
(id int, fname text, lname text)
LOCATION ('s3:// config=/home/gpadmin/s3_demo.conf')

2. Select the data.

gpadmin=# select * from demo_read limit 10;
 id  |  fname  |    lname    
  58 | Jon_58  | Roberts_58
  90 | Jon_90  | Roberts_90
 122 | Jon_122 | Roberts_122
 191 | Jon_191 | Roberts_191
 207 | Jon_207 | Roberts_207
 239 | Jon_239 | Roberts_239
 271 | Jon_271 | Roberts_271
 319 | Jon_319 | Roberts_319
 335 | Jon_335 | Roberts_335
 351 | Jon_351 | Roberts_351
(10 rows)

Time: 1139.538 ms

1. An S3 External Table that references a single file will only use a single Segment to read the data. Instead, try to have at least 1 file per Segment for an S3 External Table.
2. S3 External Tables supports gzip compression only.
3. Use S3 External Tables for cold storage or to create a backup of a table or query using a Writable External Table.

Greenplum Append Optimized Tables

Greenplum has two major types of table storage techniques. The first is inherited from PostgreSQL so it uses Multi-Version Concurrency Control (MVCC) and is referred to as HEAP storage. The second is more efficient technique called Append Optimized (AO). This blog post discusses the performance benefits of AO over HEAP tables.

This is the default storage technique when you create a table in Greenplum. It handles UPDATE and DELETE commands just like PostgreSQL in terms of marking the old row invalid and inserting a new record into the table. This hides the deleted or stale row for new queries but provides read consistency to sessions that may have started querying the table prior to the UPDATE or DELETE.

Here is an example of a HEAP table.

CREATE TABLE heap_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)

But how does this HEAP table store data? How much space does it use?

INSERT INTO heap_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE heap_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'heap_table';


So it uses 1,114,112 bytes in my small single node test for this generated data of 10,000 records.

AO storage was originally designed to be Append Only but starting with Greenplum version 4.3, these tables became Append Optimized because the tables now allow UPDATE and INSERT like a HEAP table but retain the efficient storage.

Create the same table and data but as an AO table.

(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true)

INSERT INTO ao_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_table';


This is now only using 973,016 bytes or a 12.6% reduction in disk space being used. For large tables, the performance difference will be more noticeable because it will take less time to scan your disks.

AO tables also let you compress it where a HEAP table does not. What does that look like?

CREATE TABLE ao_compressed_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true, compresstype=quicklz)

INSERT INTO ao_compressed_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_compressed_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_compressed_table';


Dropped more to 234,344 bytes. This is a huge decrease in disk space being used and measures at another 75.9% reduction! This is trading CPU cycles for IO scans which in most cases, will provide better query times.

In the last test, make the table also column oriented. This is also a feature only available for AO tables.

CREATE TABLE ao_co_compressed_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true, compresstype=quicklz, orientation=column)

INSERT INTO ao_co_compressed_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_co_compressed_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_co_compressed_table';


It dropped again! This time to 196,840 bytes or another 16%.

The total percentage difference between a HEAP table and the AO table that is column oriented and compressed is 82.33%!

Use AO tables. These are faster than HEAP tables and also let you compress and column orient your larger tables for even better performance.

Pivotal HDB 2.0.1 Configuration Tips

Here are some tips for configuring Pivotal HDB (based on Apache HAWQ) with Ambari.

Temp Directories
A Hadoop cluster typically is configured with JBOD so utilize all data disks for temp space.

Here is an example of the “HAWQ Master Temp Directories” entry when the Master and Standby nodes each have 8 disks:


Here is an example of the “HAWQ Segment Temp Directories” entry when each Data Node has 8 disks:


VM Overcommit
VM Overcommit set to 2

VM Overcommit Ratio
2GB – 64GB: set the Overcommit Ratio to 50
>= 64GB of RAM: set the Overcommit Ratio to 100

Swap Space
2GB – 8GB: set swap space equal to RAM
8GB – 64GB: set swap space to 0.5 * RAM
>= 64GB: set swap space to 4GB

Segment Memory Usage Limit
Step 1: Calculate total memory (RAM * overcommit_ratio_percentage + SWAP)
Step 2: Calculate total memory used by other activities (2GB for OS, 2GB for Data Node, 2GB for Node Manager, 1GB for PXF)
Step 3: Subtract other memory from total memory to get the value for the Segment Memory Usage Limit

Example 1:
RAM: 256GB
Other: 7GB
Overcommit Ratio: 100

Using Yarn: ((256 * 1) + 4) – 7 = 253
Using Default Resource Manager: (256 * 1) – 7 = 249

Example 2:
Other: 7GB
Overcommit Ratio: 50

Using Yarn: ((64 * 0.5) + 32) – 7 = 57
Using Default Resource Manager: (64 * 0.5) – 7 = 57

HDFS core-site.xml

ipc.client.connect.timeout = 300000


ipc.client.connection.maxidletime = 3600000

Optional HAWQ hawq-site.xml

hawq_rm_stmt_vseg_memory = 1gb 

By default, this is set to 128mb which is great for a high level of concurrency. If you need to utilize more memory in the cluster for each query, you can increase this value considerably. Here are the acceptable values:

128mb, 256mb, 512mb, 1gb, 2gb, 4gb, 8gb, 16gb

Alternatively, you can set this at the session level instead of the entire database.

Operating System gpadmin account
Log into the Master and Standby nodes and execute the following:

echo "source /usr/local/hawq/" >> ~/.bashrc

Now set the database password. Below, I am using ‘password’ as the password so set this based on your organization’s password policy. By default, gpadmin doesn’t have a password set at all.

psql -c "alter user gpadmin password 'password'"

Enable encrypted password authentication. This assumes you are using the default /data/hawq/master path. Adjust if needed. This allows you to connect to the database remotely with an encrypted password.

echo "host all all md5" >> /data/hawq/master/pg_hba.conf
hawq stop cluster -u -a

Outsourcer 5.2.0 and Pivotal HDB 2.0.1

Pivotal HDB 2.0.1 (based on Apache HAWQ)
The newest release of Pivotal HDB is fantastic! It adds new features and resolves some issues as well. Here are the release notes.

One resolved issue affected Outsourcer which is a JDBC problem and is documented here: HAWQ-738. Pivotal HDB 2.0.0 was released in May and I found that Outsourcer suffered from stability issues under load and was generally slower navigating in the UI than HDB 1.3 or Greenplum Database. The issue was quickly resolved in that same month but the fix wasn’t publicly available until October of 2016 with this new release.

Quicklz is a compression library that uses GPL licensing. It is bundled with Greenplum Database and Pivotal HDB 1.3 and 2.0.0. Starting with Pivotal HDB 2.0.1, Quicklz has been removed. Why? Because of that GPL license and HAWQ is an Apache project.

In HDB 1.3 and 2.0.0, the guidance was to use Quicklz compression for row oriented tables but starting with 2.0.1, you should use Snappy compression. It is an easy change too:

CREATE TABLE mytable (id INT, fname TEXT) 

Note: When upgrading to 2.0.1, be sure to first change your Quicklz compressed tables to either Zlib, Parquet with Snappy, or no compression at all. Then upgrade and then you can change back to row orientation and use Snappy compression. More details are in this section of the release notes.

Future Releases
Future releases of Pivotal HDB should come quicker because the licensing hurdle is now complete. This means developers can focus on enhancements and fixes rather than licensing.

Outsourcer 5.2.0
This new release officially supports Pivotal HDB 2.0.1. It makes the compression for row oriented tables to now use Snappy instead of Quicklz. If you are using Pivotal HDB 2.0.0 or Greenplum Database, Outsourcer will still use Quicklz compression.

Please consider upgrading to Pivotal HDB 2.0.0 to 2.0.1 especially if you are using Outsourcer. When I test Outsourcer with Pivotal HDB 2.0, I use build 22425 rather than build 22126 which is what you can download from Pivotal’s site. 22126 has the JDBC bug while 22425 and new builds do not. And when you upgrade to 2.0.1, also upgrade to Outsourcer 5.2.0.

Download 5.2.0!
Source Code

New Project: gpresize

I was working with a customer recently that wanted to try more segments per host than what was originally configured in their cluster without adding more hosts. This is possible with gpexpand but only if mirroring has not been enabled. Disabling mirror, however, is not a supported feature of Greenplum but it is possible. So, to facilitate this analysis, I created “gpresize”.

– Removes mirrors (unsupported feature in Greenplum)
– Backup the database with gpcrondump
– Expands the cluster using gpexpand
– Shrink the database by reinitializing the database and restoring from backup
– Add mirroring back using gpaddmirrors and gpinitstandy

Please, don’t use this in a Production cluster. I’m using unsupported commands to do this work and it is intended for evaluation purposes only. It is intended to help you easily increase the number of segments per host and then revert back if needed.


Hive SQL Language Support

Hive is probably the most popular SQL engine for Hadoop but not because it is fast nor does it support SQL used by most organizations. It is just the oldest. Or is it?

Pivotal HDB, powered by Apache HAWQ (incubating), is much faster and has much better SQL language support. Pivotal HDB gets its roots from Greenplum database which has been around longer so the SQL language support is much more mature.

In all of the performance tests I’ve done comparing Pivotal HDB with Apache Hive, the results are dramatic. Pivotal HDB loads data significantly faster and executes queries significantly faster too. And yes, I’ve tested it with Tez and even with LLAP. Hive doesn’t come close to the performance of Pivotal HDB. More will be published on this subject soon.

SQL Language Support
The real reason for this post is the lack of SQL language support that Hive has. This makes it more difficult to write SQL in this environment and it makes it more difficult to migrate from a legacy RDBMS to Hadoop. Here are some examples.

Example 1: Subqueries in the WHERE Clause
This is query 6 from the TPC-DS benchmark:

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer c
     ,store_sales s
     ,date_dim d
     ,item i
 where       a.ca_address_sk = c.c_current_addr_sk
 	and c.c_customer_sk = s.ss_customer_sk
 	and s.ss_sold_date_sk = d.d_date_sk
 	and s.ss_item_sk = i.i_item_sk
 	and d.d_month_seq = 
 	     (select distinct (d_month_seq)
 	      from date_dim
               where d_year = 1998
 	        and d_moy = 2 )
 	and i.i_current_price > 1.2 * 
             (select avg(j.i_current_price) 
 	     from item j 
 	     where j.i_category = i.i_category)
 group by a.ca_state
 having count(*) >= 10
 order by cnt 
 limit 100;

This query won’t work in Hive because of the two subqueries in the WHERE clause. You will have to rewrite this query like this:

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer c
     ,store_sales s
     ,date_dim d
     ,item i
     , (select distinct (d_month_seq) as d_month_seq
 	      from date_dim
               where d_year = 1998
 	        and d_moy = 2 ) as sq1
     , (select j.i_category, avg(j.i_current_price) as avg_i_current_price
 	     from item j 
        group by j.i_category) as sq2
 where       a.ca_address_sk = c.c_current_addr_sk
 	and c.c_customer_sk = s.ss_customer_sk
 	and s.ss_sold_date_sk = d.d_date_sk
 	and s.ss_item_sk = i.i_item_sk
 	and d.d_month_seq = sq1.d_month_seq
        and sq2.i_category = i.i_category
 	and i.i_current_price > 1.2 * sq2.avg_i_current_price
 group by a.ca_state
 having count(*) >= 10
 order by cnt 
 limit 100;

It took me a few minutes to rewrite this and I’m still not 100% sure it is right. In a conversion effort, I would have to execute it in the legacy RDBMS and see if the results are the same as in Hive. If the legacy RDBMS is changing, then I would have a very difficult time with just a single query.

If I was doing this for benchmarking purposes, I would have to execute TPC-DS in another database that does support this query with the same data to validate the results. Clearly, the lack of SQL language support is a problem.

Example 2: Subqueries in the SELECT Clause
This is query 9 from the TPC-DS benchmark:

select case when (select count(*) 
                  from store_sales 
                  where ss_quantity between 1 and 20) > 31003
            then (select avg(ss_ext_list_price) 
                  from store_sales 
                  where ss_quantity between 1 and 20) 
            else (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 1 and 20) end bucket1 ,
       case when (select count(*)
                  from store_sales
                  where ss_quantity between 21 and 40) > 24212
            then (select avg(ss_ext_list_price)
                  from store_sales
                  where ss_quantity between 21 and 40) 
            else (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 21 and 40) end bucket2,
       case when (select count(*)
                  from store_sales
                  where ss_quantity between 41 and 60) > 28398
            then (select avg(ss_ext_list_price)
                  from store_sales
                  where ss_quantity between 41 and 60)
            else (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 41 and 60) end bucket3,
       case when (select count(*)
                  from store_sales
                  where ss_quantity between 61 and 80) > 21646
            then (select avg(ss_ext_list_price)
                  from store_sales
                  where ss_quantity between 61 and 80)
            else (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 61 and 80) end bucket4,
       case when (select count(*)
                  from store_sales
                  where ss_quantity between 81 and 100) > 4078
            then (select avg(ss_ext_list_price)
                  from store_sales
                  where ss_quantity between 81 and 100)
            else (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 81 and 100) end bucket5
from reason
where r_reason_sk = 1

You get this error:

FAILED: ParseException line 2:18 cannot recognize input near '(' 'select' 'count' in expression specification (state=42000,code=40000)

The fix is to move all of those subqueries from the SELECT to the FROM section of the query. Or you can write an elaborate CASE statement. Again, this takes time to rewrite and even more time to validate.

Example 3: Correlated Subqueries
This is query 10 from the TPC-DS benchmark:

  count(*) cnt1,
  count(*) cnt2,
  count(*) cnt3,
  count(*) cnt4,
  count(*) cnt5,
  count(*) cnt6
  customer c,customer_address ca,customer_demographics
  c.c_current_addr_sk = ca.ca_address_sk and
  ca_county in ('Clinton County','Platte County','Franklin County','Louisa County','Harmon County') and
  cd_demo_sk = c.c_current_cdemo_sk and 
  exists (select *
          from store_sales,date_dim
          where c.c_customer_sk = ss_customer_sk and
                ss_sold_date_sk = d_date_sk and
                d_year = 2002 and
                d_moy between 3 and 3+3) and
   (exists (select *
            from web_sales,date_dim
            where c.c_customer_sk = ws_bill_customer_sk and
                  ws_sold_date_sk = d_date_sk and
                  d_year = 2002 and
                  d_moy between 3 ANd 3+3) or 
    exists (select * 
            from catalog_sales,date_dim
            where c.c_customer_sk = cs_ship_customer_sk and
                  cs_sold_date_sk = d_date_sk and
                  d_year = 2002 and
                  d_moy between 3 and 3+3))
 group by cd_gender,
 order by cd_gender,
limit 100;

This query correlates data between customer and the tables in each subquery. Rewriting this can be tricky because you need to make sure you don’t duplicate data when joining to the subqueries.

Example 4: INTERSECT
INTERSECT is a feature used in SQL that is similar to UNION or EXCEPT. It gets a distinct list of values from two queries that are contained in both queries. More information on this feature:

This is query 8 from the TPC-DS benchmark:

select  s_store_name
 from store_sales
     (select ca_zip
     from (
     (SELECT substr(ca_zip,1,5) ca_zip
      FROM customer_address
      WHERE substr(ca_zip,1,5) IN (
     (select ca_zip
      from (SELECT substr(ca_zip,1,5) ca_zip,count(*) cnt
            FROM customer_address, customer
            WHERE ca_address_sk = c_current_addr_sk and
            group by ca_zip
            having count(*) > 10)A1))A2) V1
 where ss_store_sk = s_store_sk
  and ss_sold_date_sk = d_date_sk
  and d_qoy = 1 and d_year = 2002
  and (substr(s_zip,1,2) = substr(V1.ca_zip,1,2))
 group by s_store_name
 order by s_store_name
 limit 100;

The fix for this is to change the INTERSECT to UNION and make this another subquery that also does a GROUP BY so you have a distinct list of zip codes.

Of the 99 TPC-DS queries, 19 contains SQL that is not supported by Hive. Pivotal HDB can execute all 99 TPC-DS queries without modification. So not only it is much faster, it is easier to migrate from a legacy RDBMS to Hadoop and easier to start using because the SQL language support is so robust.

GPLink 0.1.30 Supports Hive!

GPLink has been tested successfully with Hive. This is most useful in the Greenplum environment where you want to query a Hive table. HAWQ has the PXF protocol which supports Hive and is very performant. Please use PXF in HAWQ to access Hive. But with Greenplum, gplink now supports Hive!

This is in the gplink README file but the configuration is a bit more involved than a typical JDBC driver. This is because of how Hive is configured for logging. Plus, this configuration has changed through the Hive versions making it more complex to configure.

You will need the following jar files from your Hadoop cluster.

I tested with a Hortonworks cluster with Hive

Older versions of Hive may have have duplicate SLF4J bindings and fail to work
properly. The error message is, “Class path contains multiple SLF4J bindings”. If
you get this, remove the slf4j-log4j12.jar file from the jar/ directory, source the file, and then try again.

Bug Fixes
The extraProps value for external connections wasn’t being parsed correctly. The extraProps is most useful for Oracle in setting the fetch size. If you are using Oracle as a source with gplink, be sure to upgrade to this new version and change your connection from “extraProperies” or “extraProperties” to “extraProps”.

Download 0.1.30 here.

GPLink 0.1.26

GPLink has a new version with the only change being the name of a script variable. The variable was renamed because it is shared between this project and another one I work on (TPC-DS). This common variable name would cause issues with the TPC-DS scripts when logging activities.


Loading Data into HAWQ

Loading data into the database is required to start using it but how? There are several approaches to achieve this basic requirement but achieve the result by approaching the problem in different ways. This allows you to load data that best matches your use case.

Table Setup
This table will be used for the testing in HAWQ. I have this table created in a single node VM running Hortonworks HDP with HAWQ 2.0 installed. I’m using the default Resource Manager too.

CREATE TABLE test_data
(id int,
 fname text,
 lname text)

Let’s start with probably the worst way first. Sometimes this way is ideal because you have very little data to load but in most cases, avoid singleton inserts. This approach inserts just a single tuple in a single transaction.

head si_test_data.sql
insert into test_data (id, fname, lname) values (1, 'jon_00001', 'roberts_00001');
insert into test_data (id, fname, lname) values (2, 'jon_00002', 'roberts_00002');
insert into test_data (id, fname, lname) values (3, 'jon_00003', 'roberts_00003');
insert into test_data (id, fname, lname) values (4, 'jon_00004', 'roberts_00004');
insert into test_data (id, fname, lname) values (5, 'jon_00005', 'roberts_00005');
insert into test_data (id, fname, lname) values (6, 'jon_00006', 'roberts_00006');
insert into test_data (id, fname, lname) values (7, 'jon_00007', 'roberts_00007');
insert into test_data (id, fname, lname) values (8, 'jon_00008', 'roberts_00008');
insert into test_data (id, fname, lname) values (9, 'jon_00009', 'roberts_00009');
insert into test_data (id, fname, lname) values (10, 'jon_00010', 'roberts_00010');

This repeats for 10,000 tuples.

time psql -f si_test_data.sql > /dev/null
real	5m49.527s

As you can see, this is pretty slow and not recommended for inserting large amounts of data. Nearly 6 minutes to load 10,000 tuples is crawling.

If you are familiar with PostgreSQL then you will feel right at home with this technique. This time, the data is in a file named test_data.txt and it is not wrapped with an insert statement.

head test_data.txt
COPY test_data FROM '/home/gpadmin/test_data.txt' WITH DELIMITER '|';
COPY 10000
Time: 128.580 ms

This method is significantly faster but it loads the data through the master. This means it doesn’t scale well as the master will become the bottleneck but it does allow you to load data from a host anywhere on your network so long as it has access to the master.

gpfdist is a web server that serves posix files for the segments to fetch. Segment processes will get the data directly from gpfdist and bypass the master when doing so. This enables you to scale by adding more gpfdist processes and/or more segments.

gpfdist -p 8888 &
[1] 128836
[gpadmin@hdb ~]$ Serving HTTP on port 8888, directory /home/gpadmin

Now you’ll need to create a new external table to read the data from gpfdist.

CREATE EXTERNAL TABLE gpfdist_test_data
(id int,
 fname text,
 lname text)
LOCATION ('gpfdist://hdb:8888/test_data.txt')

And to load the data.

INSERT INTO test_data SELECT * FROM gpfdist_test_data;
INSERT 0 10000
Time: 98.362 ms

gpfdist is blazing fast and scales easily. You can add more than one gpfdist location in the external table, use wild cards, use different formats, and much more. The downside is the file must be on a host that all segments can reach. You also have to create a separate gpfdist process on that host.

gpload is a utility that automates the loading process by using gpfdist. Review the documentation for more on this utility. Technically, it is the same as gpfdist and external tables but just automates the commands for you.

Programmable Extension Framework (PXF)
PXF allows you to read and write data to HDFS using external tables. Like using gpfdist, it is done by each segment so it scales and executes in parallel.

For this example, I’ve loaded the test data into HDFS.

hdfs dfs -cat /test_data/* | head

The external table definition.

(id int,
 fname text,
 lname text)
LOCATION ('pxf://hdb:51200/test_data?Profile=HdfsTextSimple')

And now to load it.

INSERT INTO test_data SELECT * FROM et_test_data;
INSERT 0 10000
Time: 227.599 ms

PXF is probably the best way to load data when using the “Data Lake” design. You load your raw data into HDFS and then consume it with a variety of tools in the Hadoop ecosystem. PXF can also read and write other formats.

Outsourcer and gplink
Last but not least are software programs I created. Outsourcer automates the table creation and load of data directly to Greenplum or HAWQ using gpfdist. It sources data from SQL Server and Oracle as these are the two most common OLTP databases.

gplink is another tool that can read external data but this technique can connect to any valid JDBC source. It doesn’t automate many of the steps that Oustourcer does but it is a convenient tool to get data from a JDBC source.

You might be thinking that sqoop does this but not exactly. gplink and Outsourcer load data into HAWQ and Greenplum tables. It is optimized for these databases and fixes data for you automatically. Both remove null and newline characters and escapes the escape and delimiter characters. With sqoop, you will have to read the data from HDFS using PXF and then fix the errors that could be in the files.

Both tools are linked above.

This post gives a brief description on the various ways to load data into HAWQ. Pick the right technique for your use case. As you can see, HAWQ is very flexible and can handle a variety of ways to load data.

GPLink Version 0.1.25

GPLink has a new version that fixes a situation where you may have started other gpfdist processes on your ETL server and you used a different order of parameters than what GPLink uses. The older versions would have a parsing error and this new version handles this situation. As before, if the port range of other gpfdist processes are outside the Upper and Lower port range defined by GPLink, then those gpfdist processes are ignored.

Download Version 0.1.25

Getting the filename with gpfdist

Occasionally, I see the request to get the filename added to a file read by gpfdist. Here is a way to do it!

First, create a YML file named “transform_config.yml” with the following:

     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash

Next, create that “” file. This is just a simple example that gets all txt files but you can also pass in a parameter to this script. The filename in the external table gets passed to the script.

set -e
for i in $(ls *.txt); do awk '{print FILENAME"|"$0}' $i; done

Create two test files (test_file_1.txt and test_file_2.txt).

cat test_file_1.txt 
cat test_file_2.txt 

Start gpfdist in the background.

gpfdist -p 8999 -c transform_config.yml > mylog 2>&1 < mylog &

Create the External Table but be sure to change the hostname. Note that "foo" in the LOCATION is the filename. I'm ignoring it for this example but this is how you can pass parameters to the script. You add %filename% to the YML file as the parameter to the script.

CREATE EXTERNAL TABLE ext_transform_table 
(filename text, id int, descrption text) 
LOCATION ('gpfdist://gpdbsne:8999/foo#transform=transformation_input') 

Now select from the External Table.

select * from ext_transform_table;
    filename     | id | descrption 
 test_file_1.txt |  1 | foo1
 test_file_1.txt |  2 | foo2
 test_file_1.txt |  3 | foo3
 test_file_1.txt |  4 | foo4
 test_file_1.txt |  5 | foo5
 test_file_1.txt |  6 | foo6
 test_file_1.txt |  7 | foo7
 test_file_1.txt |  8 | foo8
 test_file_1.txt |  9 | foo9
 test_file_1.txt | 10 | foo10
 test_file_2.txt | 11 | foo11
 test_file_2.txt | 12 | foo12
 test_file_2.txt | 13 | foo13
 test_file_2.txt | 14 | foo14
 test_file_2.txt | 15 | foo15
 test_file_2.txt | 16 | foo16
 test_file_2.txt | 17 | foo17
 test_file_2.txt | 18 | foo18
 test_file_2.txt | 19 | foo19
 test_file_2.txt | 20 | foo20
(20 rows)

HAWQ 2.0 Generally Available and Outsourcer 5.1.4

HAWQ, or commercially known as Pivotal HDB, just had a major release that I’m really excited about.

Major Features
– Based on Apache HAWQ and also includes support for Quicklz table compression plus support for PL/R, PL/Java, and pgCrypto
– Elastic runtime which means more segments (resources) can be allocated automatically based on the complexity of the query
– YARN integration
– Dynamic sizing of the cluster
– Block level storage which enables maximum parallelism
– Single HDFS directory per table which makes it easier to share and manage data
– Fault tolerance enhancements makes it easier and quicker to add or remove data nodes
– HDFS catalog cacheing
– HCatalog integration which greatly simplifies accessing Hive data
– New management interface with “hawq” commands
– Support for Ambari 2.2.2
– Plugin support for Kerberos
– Better logging for runaway query termination

Product Page

Outsourcer 5.1.4
I also have updated Outsourcer to take advantage of HAWQ 2.0/Pivotal HDB 2.0. In HAWQ 2.0/Pivotal HDB 2.0, tables should be distributed randomly in order to take advantage of many of the new features. Starting with version 5.1.4, Outsourcer will now make all tables distributed randomly when the database is HAWQ 2.0/Pivotal HDB 2.0. For Greenplum and HAWQ 1.3, the tables will still be distributed by the source’s primary key if one is found.

Download 5.1.4
Source Code

Pivotal HDB 2.0 (Apache HAWQ) Table Distribution

Pivotal HDB version 2.0 is very close to being generally available and how table distribution works between this version and 1.3 which are worth mentioning.

HDB is a fork of Greenplum Database which is an MPP database. Greenplum distributes or shards the data across multiple “segments” which are located on multiple “segment hosts”. The distribution is typically set by a hash of a column or set of columns.


CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)
DISTRIBUTED BY (customer_id);

In Greenplum, this would create a file in each segment in each segment host. If you made the table column oriented, the number of files increases with a file per column, per segment. Add in partitioning which again uses separate files for each partition, you end up with possibly thousands of files for a single table. This is great for an MPP database with a robust optimizer that can skip scanning files it doesn’t need to in order to execute the query in the fastest way possible.

HDB 1.3 uses the same design pattern as Greenplum but it stores the files in HDFS. Hadoop loves big files but doesn’t work optimally with lots of files which meant that you didn’t typically use column orientation and partitions were larger.

HDB 2.0 Distribution
1. There are now a dynamic number of segment processes per host. There is just a single segment directory per data node and the database will dynamically create the number of buckets as needed.

2. When you create a table with the distribution set (as shown above), the number of buckets is fixed. This is set with the GUC default_hash_table_bucket_number which sets the number of buckets per host.

3. When you create a random distribution table, the number of buckets is dynamic.

So how does this work? Take our example “customer” table above with the distribution set to (customer_id).

INSERT INTO customer SELECT i, 'company_' || i, i || ' main st' 
FROM generate_series(1,1000) AS i;

Query returned successfully: 1000 rows affected, 784 msec execution time.

Now let’s go look at the files.

--use the OID values to find the location in HDFS
SELECT oid FROM pg_database WHERE datname = 'gpadmin';


SELECT c.oid
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';


So our data files are in HDFS. This is just a single node VM I’m working with one segment. It has default_hash_table_bucket_number set to 6 so HDB will create 6 buckets of data in HDFS.

hdfs dfs -ls /hawq_default/16385/16508/24591
Found 6 items
-rw-------   1 gpadmin hdfs       6776 2016-05-04 16:51 /hawq_default/16385/16508/24591/1
-rw-------   1 gpadmin hdfs       6768 2016-05-04 16:51 /hawq_default/16385/16508/24591/2
-rw-------   1 gpadmin hdfs       6688 2016-05-04 16:51 /hawq_default/16385/16508/24591/3
-rw-------   1 gpadmin hdfs       6728 2016-05-04 16:51 /hawq_default/16385/16508/24591/4
-rw-------   1 gpadmin hdfs       7600 2016-05-04 16:51 /hawq_default/16385/16508/24591/5
-rw-------   1 gpadmin hdfs       7488 2016-05-04 16:51 /hawq_default/16385/16508/24591/6

Now recreate this table with random distribution, insert the data, and look at the files.

CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)

INSERT INTO customer SELECT i, 'company_' || i, i || ' main st' 
FROM generate_series(1,1000) AS i;

SELECT c.oid
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';


[gpadmin@hdp23 ~]$ hdfs dfs -ls /hawq_default/16385/16508/24596
Found 1 items
-rw-------   1 gpadmin hdfs      41968 2016-05-04 17:02 /hawq_default/16385/16508/24596/1

It only created a single file in HDFS with random distribution.
– This is great for HDFS because there are less files for the namenode to track.
– Allows for elasticity of the cluster. Grow or shrink the cluster without having to redistribute the data.
– The optimizer has also been enhanced to dynamically set the number of buckets based on the demand of the query.

As you might be concluding right about now, RANDOM DISTRIBUTION is the recommendation for tables in HDB 2.0. You can still set your distribution to a hash of a column or columns which will use a static number of buckets but random is recommended.

If you create a table now in HDB 2.0 without setting the distribution, the default will be RANDOM.


CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text)
 DISTRIBUTED BY (customer_id);

SELECT sub.attname
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
(SELECT p.attrelid, p.attname
FROM pg_attribute p
JOIN (SELECT localoid, unnest(attrnums) AS attnum FROM gp_distribution_policy) AS g ON g.localoid = p.attrelid AND g.attnum = p.attnum) AS sub
ON c.oid = sub.attrelid 
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';


Now, recreate the table without setting the distribution key.

CREATE TABLE customer 
(customer_id int,
 customer_name text,
 customer_address text);

SELECT sub.attname
FROM pg_namespace n
JOIN pg_class c ON n.oid = c.relnamespace
(SELECT p.attrelid, p.attname
FROM pg_attribute p
JOIN (SELECT localoid, unnest(attrnums) AS attnum FROM gp_distribution_policy) AS g ON g.localoid = p.attrelid AND g.attnum = p.attnum) AS sub
ON c.oid = sub.attrelid 
WHERE n.nspname = 'public'
AND c.relname = 'customer'
AND c.relkind = 'r';

--no data

Outsourcer 5.1.1

5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.

Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.

Source Code

HAWQ Demystified

The genesis of the best SQL engine in Hadoop is not an overnight, “me too” product. It seems that it wasn’t too long ago the Hadoop vendors all but wrote off SQL but the demand just hasn’t gone away.

HAWQ is the result of many, many years of work leveraging open source software as well as contributing back to the open source community. I think a short history lesson is in order to fully understand how this product came to be.

Greenplum Database began life in 2003. The founders used open source PostgreSQL and released a commercial product soon after.

Bizgres, an open source version of Greenplum, was released in 2005. Very early on, the founders of Greenplum embraced contributing back to the open source community.

Madlib was released in 2010 as an open source project which later became an Apache Incubator project.

Greenplum was acquired by EMC in 2010 and almost immediately, EMC invested heavily into Hadoop. The Greenplum division was agile like a small startup company but with the deep pockets of a multi-billion dollar company.

Greenplum released a Hadoop distribution in 2011 and integration between Greenplum Database and HDFS got more robust with the introduction of “gphdfs”. Greenplum supported External Tables to read/write data in parallel to HDFS from several different distributions.

HAWQ, a fork of Greenplum Database, was released in 2013. HAWQ was immediately extremely performant and compliant with the newest SQL syntax. HAWQ borrowed from the 10 years experience of developing Greenplum to provide a robust optimizer designed for HDFS.

2013 also saw Pivotal become a company. EMC contributed Greenplum Database, VMWare contributed Gemfire and Cloud Foundry, and GE contributed capital as an active partner. Paul Maritiz became the CEO and the dedication to fully embrace open source became an integral part of the corporate culture.

During the last three years, HAWQ has become an Apache Incubator Project. The Pivotal product is now a rather boring name of “Pivotal HDB” while HAWQ is the name of the Apache project.

Pivotal also made Greenplum and Geode (Gemfire is the commercial product name) open source projects too. Clearly, Pivotal has embraced open source with probably more committers than those other “open source” data companies.

So what now? What is happening in 2016? Well, Pivotal is about to release Pivotal HDB (HAWQ) 2.0. I’ve been testing this product for months on various platforms and I keep getting amazed by the performance and ease of use.

HAWQ 2.0 embraces Hadoop fully. I believe the two biggest features are elasticity and performance. HAWQ now supports elasticity for growing or shrinking clusters without having to redistribute the data. The performance is also much improved as it better utilizes HDFS and YARN.

Pivotal HDB is certified to run on Hortonworks HDP with plans on becoming a first class citizen of the Open Data Platform (ODPi).

So you may be asking, “is it fast?” and the answer is yes! I haven’t found a SQL engine that is faster and I’ve been doing competitive analysis for months. The other question you may ask is, “can I run my SQL?” and the answer is yes! A major competitor in the SQL on Hadoop landscape requires more tweaking of SQL just to get the SQL to execute and more tuning to get decent performance.

That “other SQL on Hadoop” product can’t do the following things, as well as many more, that HAWQ can.

– Can’t handle a comment line at the end of SQL file (I found that rather strange)
– Won’t do partition elimination through a joined table (e.g. date dimension)
– Can’t get number of milliseconds when subtracting from a date. Only gets seconds.
– Has “interval” in SQL dialect but doesn’t have interval as a type.
– No time data type.
– Concatenating strings doesn’t use || or +. You must use concat() function.
– Doesn’t support intersect or except.
– Doesn’t support subqueries that return more than one row.
– Doesn’t support correlated subqueries.
– Doesn’t support group by rollup.
– Doesn’t support subqueries in having statement.
– Subqueries not supported in select list.

HAWQ is the real deal. It is an Apache project, going to be part of ODPi, faster than everyone else, integrated with Apache Ambari, certified with Hortonworks, and the most mature SQL engine for Hadoop.

Analyzedb Performance Metrics

I have already written about AnalyzeDB in this post but I thought I would write another post about it with performance metrics.

The old method that I used to analyze tables was to analyze every table and partition sequentially. Then I would analyze the root partition of partitioned tables. The script would look like this:

Tables and Partitions

psql -t -A -c "SELECT 'ANALYZE ' || n.nspname || '.' || c.relname || ';' FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'tpcds' AND c.relname NOT IN (SELECT DISTINCT tablename FROM pg_partitions p WHERE schemaname = 'tpcds') ORDER BY 1" | psql -e

Root Partitions of Partitioned Tables

psql -t -A -c "SELECT 'ANALYZE ROOTPARTITION ' || n.nspname || '.' || c.relname || ';' FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'tpcds' AND c.relname IN (SELECT DISTINCT tablename FROM pg_partitions p WHERE schemaname = 'tpcds') ORDER BY 1" | psql -e

The second method is with analyzedb which can be done with a single line.

analyzedb -d gpadmin -s tpcds --full -a

The default for analyzedb is to use 5 threads so you can adjust this to maximize performance. You can also take advantange of how analyzedb keeps track of the tables it has analyzed so it won’t unnecessarily analyze tables which makes the process even faster.

Here are some numbers to put this into perspective. I’m using virtual machines with only 1GB of data but the percentage improvement is what we are wanting to measure.

HAWQ 2.0 Beta
Sequential: 18 minutes and 37 seconds
Analyzedb: 8 minutes and 3 seconds
Improvement: 131% faster!

Greenplum Database 4.3
Sequential: 11 minutes and 25 seconds
Analyzedb: 6 minutes and 59 seconds
Improvement: 63% faster!

If you aren’t using analyzedb to maintain your HAWQ and/or Greenplum databases, start using it now! You’ll see much better performance in keeping your tables’ statistics up to date.

Outsourcer 5.0.9

5.0.9 adds support for HAWQ 2.0.

I’m looking for feedback on how best to handle table distribution for tables created in HAWQ 2.0. Outsourcer automatically sets distribution keys based on the source primary keys so it always uses hash when there is a PK found. HAWQ 2.0 supports hash and random distribution as before but random distribution allows a cluster to be resized without having to redistribute the data.

– Should I keep the code as-is?
– Should I set a global environment variable to allow you to set all tables to be created random or not?
– Should I update nearly every UI screen as well as the job and queue tables to have a random boolean that is only used for HAWQ 2.0?

Source Code

Greenplum Single Node Installation

Step 1
Download a CentOS 6 VM from

Step 2
Download the latest Greenplum binaries for RedHat Enterprise Linux 6 from
GPDB Download

Step 3
Start the Virtual Machine with VMWare Fusion or something similar.
Memory: 8GB
Cores: 4
Disk: 50GB
You can use less memory and cores but the more you provide the VM, the better it will perform. You might have to expand the VM disk space when using the VM.

Step 4
Configure the operating system.

hostname gpdbsne

Add the hostname to /etc/sysconfig/network too.

Turn off firewalls.

chkconfig iptables off
service iptables stop
echo 0 >/selinux/enforce
vi /etc/selinux/config
setenforce 0

Edit the /etc/hosts file.

echo " gpdbsne gpdbsne.localdomain" >> /etc/hosts

I also like to get the ip address for this host and add it to my local /etc/hosts file.


Install unzip, ed, ntp and ssh.

yum install ntp
yum install unzip
yum install openssh-clients
yum install ed
chkconfig ntpd on
/etc/init.d/ntpd start

Add the following to the end of your /etc/sysctl.conf file.

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.overcommit_memory = 0

Add this to your /etc/security/limits.conf file.

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

Remove all lines from /etc/security/limits.d/90-nproc.conf

echo "" > /etc/security/limits.d/90-nproc.conf

There are some other configuration changes you make on a real cluster that involves XFS filesystem but for a SNE, this can be skipped. This is just intended for Development and Testing purposes.

Restart the VM so these changes take affect.

shutdown -r now

Step 5
Copy the installer to the VM.

scp root@gpdbsne:/root/

Step 6
ssh to the VM and run the installer.

ssh root@gpdbsne
/bin/bash greenplum-db-

--Accept the license agreement
--Accept default installation directory

Step 7
For a multi-node cluster, the next step is to use gpseginstall but this isn’t needed with a single node installation. Instead, you have to manually create the gpadmin account and get the cluster ready for the next step.

useradd gpadmin
passwd gpadmin
chown -R gpadmin:gpadmin /usr/local/greenplum-db-
mkdir -p /data/master
mkdir /data/primary
chown -R gpadmin:gpadmin /data
su - gpadmin
echo "source /usr/local/greenplum-db/" >> .bashrc
echo "export MASTER_DATA_DIRECTORY=/data/master/gpseg-1" >> .bashrc
source .bashrc
echo "gpdbsne" > hostfile
gpssh-exkeys -f hostfile

Step 8
Create an initialize file called gp_init_config so you can initialize the database.

declare -a DATA_DIRECTORY=(/data/primary /data/primary )

Step 9
Initialize the database.

gpinitsystem -c ~/gp_init_config
--select Y to continue

Now create the default database and configure the database to allow external connections.

psql -c "create database gpadmin" template1
psql -c "alter user gpadmin password 'changeme'"
echo "host all all md5" >> /data/master/gpseg-1/pg_hba.conf
gpstop -u

Now you can connect to Greenplum with the gpadmin account and the password is changeme. The default port is 5432. A great client tool is pgAdmin III v1.14.3.

Dear Cloudera, Please Stop!

So I was alerted to this blog post by a colleague and I was floored on what Cloudera is doing.

Basically, they are creating a new feature for objects that can store multiple levels in a single table. In a traditional relational structure, you may have an Orders table with another table for Order_Details. You would simply JOIN the two tables together when needed.


CREATE TABLE example.Orders
(order_id int NOT NULL,
 order_date date NOT NULL,
 customer_id int NOT NULL);

CREATE TABLE example.Order_Details
(order_detail_id int NOT NULL,
 order_id int NOT NULL,
 product_id int NOT NULL,
 quantity int NOT NULL,
 price numeric NOT NULL);

And when you want to query both tables:

SELECT o.order_date, sum(od.price)
FROM example.orders o 
JOIN example.order_details od ON o.order_id = od.order_id
GROUP BY o.order_date;

What Cloudera is doing is an Object approach which combines both tables into a single table using a feature called a STRUCT. This is similar to a composite type in Greenplum, PostgreSQL, and HAWQ.

(order_id int NOT NULL,
 order_date date NOT NULL,
 customer_id int NOT NULL,
 Order_Details ARRAY<STRUCT<
    order_detail_id: int,
    product_id: int,
    quantity: int,
    price: numeric>>

I get this approach when working with a OLTP system and developers want to match their Classes to the database structure but I really don’t see the benefit of this in a big data platform. It just makes it difficult for Database Users to understand and use the data.

Every database I’ve ever worked with uses Schema.Table to organize table structures but not Impala! Cloudera has decided to break away from the decades old standard and use the dot notation for their nested fields stored within a STRUCT. If you adopt this silliness, you are adopting a standard that no one else uses. Why would Cloudera want that? Vendor lock-in maybe????

select * from orders.order_details;

Orders isn’t the schema in the above example. This is a table. Cloudera just broke SQL!

Cartesian Product? Nope, Just Vendor Specific Syntax
Notice how this query appears to reference two different tables without specifying a join.

SELECT o.order_date, sum(od.price)
FROM orders o, o.order_details od 
GROUP BY o.order_date;

They even allow you to do an OUTER JOIN to the STRUCT ARRAY without defining the columns to join on.

SELECT o.order_date, sum(od.price)
FROM orders o
LEFT OUTER JOIN o.order_details od 
GROUP BY o.order_date;

Reasons Why Cloudera is Doing This
Here are my guesses as to why they are doing this.
1. They hate SQL or just don’t understand it.
2. Vendor Lock-In. They need ways to make it difficult for customers to switch to a competitor’s product.
3. Impala is bad at joins. Organizing tables like this make it easier for Impala’s query optimizer to create a robust query plan.

Please Cloudera, just stop!

Updated gplink

I’ve made some enhancements to gplink.

1. Automatically starts gpfdist when creating External Tables
2. New command line programs: gpltable (create table), gpldrop (drop table), and gplstart (start any missing gpfdist processes)
3. Removal of the need for the Greenplum or HAWQ client
4. No additional database objects are needed
5. Drop External Table handled from gplink also stops the gpfdist process it had started
6. Most code moved to Java with some basic shell scripts to automate calling the Java methods

Documentation and Source Code
gplink Github Repository


TPC-DS Benchmark

The TPC is a non-profit organization that provides several benchmarks for databases. The two common benchmarks they provide for Data Warehousing and Analytics are the Decision Support (TPC-DS) and Ad-Hoc (TPC-H) benchmarks. More information can be found here:

SQL on Hadoop
HAWQ can execute all 99 queries without modification.
– IBM Big SQL can execute all 99 queries but requires modifying 12 queries (as of the most recent publication I can find).
– Impala can’t run all of the queries and many require modifications to execute.
– Hive can’t run all of the queries and many require modifications to execute.

There isn’t much point to compare HAWQ with these other SQL engines when none can execute all of the queries without modification! So… this post will focus on the capabilities of HAWQ as well as the automated TPC-DS benchmark process I have put together.

The tests will use a Hortonworks 2.2 Sandbox with 8GB of RAM and 4 cores dedicated to the VM. The dataset will be 2GB in size which works well in a VM. This is definitely not a huge test but it will demonstrate the capabilities of HAWQ. Again, all 99 queries can run in HAWQ without any modification!

I’m using the tests provided here: which I put together to help evaluate different hardware platforms. This script works with HAWQ and with Greenplum database.

This script automates the entire process of generating the data, building the tables, loading the tables, and executing the TPC-DS queries.

1. Compile TPC-DS: 3.4 seconds
2. Generate the Data using dsdgen: 3 minutes 23 seconds
3. Create tables and ensure correct optimizer settings are enabled: 5.5 seconds
4. Load the tables: 2 minutes and 5 seconds
5. Execute all 99 TPC-DS queries: 6 minutes 46 seconds


To put this into perspective, I attempted to run a 2GB TPC-DS benchmark in the same VM with another SQL on Hadoop tool and it took 2 hours to just get all of the data loaded! The engine was unable to execute the vendor provided SQL queries either so I gave up on that effort. HAWQ is the way to go!

Greenplum Database


There are lots of new features in this maintenance release of Greenplum database.

1. External Tables can now read Parquet and Avro files in HDFS.
2. Partitioned Tables can now have External Partitions. This would be useful to have a Greenplum table with some data residing external to the database in something like Hadoop.
3. More enhancements to the Query Optimizer.
4. Native connection pooling.
5. COPY commands can be controlled by Resource Queues.
6. Utility analyzedb can now analyze catalog tables.
7. Utility gptransfer has enhancements.
8. Improvements in resynchronization of segment mirrors.
9. Utilities gpmigrator and gpmigrator_mirror have enhancements.
10. Improvements in gpsupport utility.

Apache HAWQ

It is official! The Apache Software Foundation has accepted HAWQ into the Apache Incubator!

Here are the results from voting.

21 binding +1 votes
5 non-binding +1 votes
no 0 or -1 votes.

The Apache HAWQ site

Are you wondering what this is about or why this matters? Pivotal announced earlier this year that all products will be made open source. Pivotal already has many open source projects like Spring, Cloud Foundry, Gemfire, Redis, Open Data Platform, and Madlib but this will be expanded to include Greenplum Database and HAWQ.

HAWQ is already the fastest and most SQL compliant SQL on Hadoop offering in the market but not it will also be open source! This is a great thing for the Hadoop community as HAWQ is a very powerful tool for the Hadoop ecosystem.

ANALYZE and analyzedb

Note: This post is for Greenplum only and I will make a subsequent post for HAWQ.

What Is Analyze?
Analyze is a pretty simple command that gets statistics on tables so the cost based optimizer can make the best plan possible. Greenplum has a mechanism to do this automatically for you but in some circumstances it won’t so you will have to run it manually.

This is the default which simply means the database will automatically gather stats when data is inserted into the table for the first time. When you do a Create Table As Select (CTAS), it will gather the statics for you. If you create a table and then INSERT data, it will gather stats for you. But, if you INSERT data to a table that already has stats, Greenplum will not gather stats for you again.


CREATE TABLE foo (id int, fname text) DISTRIBUTED BY (id);
INSERT INTO foo VALUES (1, 'jon');
INSERT INTO foo select i, 'jon_' || i from generate_series(1,1000000) as i;

In the above example, the first INSERT statement will cause the database to gather the stats and it will record that foo only has 1 row in it. The next INSERT of 1 million records won’t trigger another ANALYZE. Do you see where you could get into trouble? You need to run ANALYZE on foo.


This sounds like a good idea. If you insert more data into a table, it will automatically gather stats for you. But, there is another GUC that is important here. gp_autostats_on_change_threshold determines how many rows need to be added to a table before an ANALYZE is run automatically again. The default value is 2,147,483,647!

So in our first example of adding 1 million rows to an existing table, ANALYZE will not run automatically if we were to set gp_autostats_mode=none.

If you want to use on_change, I suggest reducing the threshold to a lower number.

This is the last value you can set for this GUC and it does exactly what is says. Nothing.

Partitioned Tables
If you insert data into an empty partitioned table, an ANALYZE will never automatically be executed! You will have to execute ANALYZE manually.

This is really important. Inserting data into partitioned tables will not cause an ANALYZE to be executed. The stats will be incorrect and you will need to run ANALYZE on the table.

Root Partition
The new Query Optimizer, code named Orca, must have statistics on the “rootpartition”. This simply is a rollup of all of the statistics from each partition at the table level.

When enabling Orca, you will need to add this which is documented in this blog post.:

gpconfig -c optimizer_analyze_root_partition -v on --masteronly

This will tell Greenplum to gather the table level statistics on a table when you ANALYZE it. If you don’t do this, the table will appear to be empty to Orca and it will most certainly have a poor query plan.

You can also run ANALYZE on just the rootpartition.


Please Make this Easy! analyzedb
We did! There is a new utility called “analyzedb” which tracks what tables and partitions need to be analyzed and does it for you automatically. It also does the work in parallel so it completes the task much, much faster.

You can specify an entire database:

analyzedb -d gpdb

single schema:

analyzedb -d gpdb -s myschema

or a single table.

analyzedb -d gpdb -t

analyzedb is the the program to add to your maintenance scripts to ensure you have the correct statistics in the database. It is another example of the continual development and enhancements of Greenplum database.

Enabling The New Greenplum Query Optimizer

Starting with Greenplum Database, there is a new Query Optimizer that is now available that is faster than the legacy optimizer.

Greenplum Database is a fork of PostgreSQL 8.2. This fork was done to optimize SQL queries for a distributed cluster and it has done very well over the years. Often times Greenplum faster that competing MPP databases because of this Query Optimizer. But, because it has many years of development on top of PostgreSQL, it has become difficult to manage and enhance.

Orca was the code name of the new Query Optimizer that is a refactor of the legacy Query Optimizer that also includes enhancements for performance and maintainability. This post is about how to enable the new Query Optimizer.

Note: You must be using Greenplum Database or newer.

1. Log into the Master server with ssh as gpadmin.
2. Make sure the database is running and healthy.


3. Execute gpconfig:

gpconfig -c optimizer_analyze_root_partition -v on --masteronly

4. Enable the new optimizer for the entire system.

gpconfig -c optimizer -v on --masteronly

5. Apply the changes with gpstop.

gpstop -u

6. Disable the new Query Optimizer for the gpperfmon database.


7. Verify the new Query Optimizer is on.

[gpadmin@mdw ~]$ psql
psql (8.2.15)
Type "help" for help.

gpdb=# explain select version();
                   QUERY PLAN                   
 Result  (cost=0.00..0.00 rows=1 width=8)
   ->  Result  (cost=0.00..0.00 rows=1 width=1)
 Settings:  optimizer=on
(3 rows)

8. Using psql, change to the gpperfmon database and make sure the new Query Optimizer is off.

gpdb=# \c gpperfmon
You are now connected to database "gpperfmon" as user "gpadmin".
gpperfmon=# explain select version();
                QUERY PLAN                
 Result  (cost=0.00..0.01 rows=1 width=0)
 Settings:  optimizer=off
(2 rows)


HAWQ versus Hive on HDP (Part 2)

Part 1 covered the comparison of HAWQ versus Hive using Map-Reduce instead of Tez. Now in Part 2, I will execute the same scripts but with Tez enabled.

Enabling Tez is pretty easy to do with HDP You simply change this in the General section under Hive and restart Hive:

hive.execution.engine = tez

Next, I re-ran the same loads and queries. Overall, it is roughly 25% faster with Tez enabled but it is still much slower than HAWQ.

Here is the graph of results:
HAWQ vs Hive 2

– Loading was slower with Tez enabled but this is probably because I’m testing with a VM.
– Every query was about 40% faster in Hive with Tez versus without.
– HAWQ was about 30 times faster than Hive with Tez and about 57 times faster than Hive with Map-Reduce.
– The execute time reported by Hive was incorrect for each query. For example, I used “time hive -f icd9.sql” to execute a the icd9.sql query and capture the timings. Time reported:

real    0m25.060s

but Hive reported:

Time taken: 13.17 seconds, Fetched: 19 row(s)

So another test but the same result. Hive is much, much slower than HAWQ and even with Tez enabled. If you want Enterprise level SQL for Hadoop, HAWQ is the best solution.

HAWQ versus Hive on HDP (Part 1)

I recently built a Virtual Machine running CentOS 6.4, Hortonworks installation of Ambari 1.7, Hortonworks, and Pivotal HAWQ If you aren’t already familiar with with the Open Data Platform, it is the shared industry effort to standardize the core components of Hadoop (HDFS, YARN, MapReduce, and Ambari) that both Hortonworks and Pivotal are a part of. This means I have HAWQ running on the Hortonworks deployment and this is a supported configuration.

After I got this built, I decided to put a set of demo scripts I use for HAWQ in the VM. It contains a simple Star Schema of Center for Medicare and Medicaid Services (CMS) data. It consists of 5 Dimensions and a single claims Fact with 1 million claims. CMS Star Schema

The demo does the following:
1. Loads the claims Fact with 1 million claims
2. Create and loads the 5 Dimensions
3. Creates and loads a single “Sandbox” table that joins all Dimensions to build a 1NF table
4. Execute 3 basic ad-hoc queries

I then converted my scripts to do the same with Hive but I ran into one problem. CMS provides one table with a range of values rather than providing each distinct value. For example:

001|139|Infectious and parasitic diseases
240|279|Endocrine, nutritional and metabolic diseases, and immunity disorders

With HAWQ, I used generate_series(int, int) in a simple SQL query to generate every value in the range. Unfortunately, Hive doesn’t support this function so I had two options; write my own UDF or borrow the transformed data from HAWQ. I chose the latter because I could get this done faster.

And now, the results!

HAWQ Timings
Load CMS Table: 3.1 seconds
Dimensions: 1.6 seconds
Sandbox Table: 3.1 seconds
Gender Query: 0.9 second
Location Query: 0.72 second
ICD9 Query: 0.95 second
Total: 10.37 seconds

Hive Timings
Load CMS Table: 44.5 seconds
Dimensions (4 of the 5): 3 minutes 42.6 seconds
Dimension Fix: 35.7 seconds
Sandbox Table: 45.6 seconds
Gender Query: 47.5 seconds
Location Query: 48.4 seconds
ICD9 Query: 48.8 seconds
Total: 8 minutes 13.2 seconds

This graph is very telling.
HAWQ vs Hive

Hive looks to be a good entry level tool for SQL on Hadoop. The performance isn’t great but it comes with every distribution of Hadoop and is easier to use than other Hadoop tools. But HAWQ will give you significantly better performance and this can be on either the Pivotal HD or the Hortonworks distributions. Both distributions are based on the ODP which delivers the standard Hadoop core so that HAWQ will work on either distribution.

In my next post, I plan on enabling TEZ for Hive to see what the performance benefit will be. It should help but I don’t think it will come close to the performance of HAWQ. We shall see.

XML Loading Again

I covered this topic here before with a solution that handles very large XML files in Hadoop. This blog post covers parsing XML as it is loaded in Hawq or Greenplum database.


<?xml version="1.0"?>


<xsl:stylesheet version="1.0"
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="catalog"><xsl:for-each select="large-product">
Large|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
<xsl:for-each select="small-product">
Small|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>


     TYPE: input
     CONTENT: data
     COMMAND: /usr/bin/xsltproc sample.xsl %filename%

Start gpfdist:

gpfdist -c sample.yml -p 8080 >> sample.log 2>&1 < sample.log &

Create External Table

create external table sample
(product_type text, product_name text, product_price int)
location ('gpfdist://bigmac:8080/sample.xml#transform=sample')
format 'text' (delimiter '|' header);

Select the data.

gpdb=# select * from sample; 
product_type | product_name | product_price 
 Large        | foo1         |           110
 Large        | foo2         |           120
 Large        | foo3         |           130
 Large        | foo4         |           140
 Large        | foo5         |           150
 Small        | bar1         |            10
 Small        | bar2         |            20
 Small        | bar3         |            30
 Small        | bar4         |            40
 Small        | bar5         |            50
(10 rows)

This solution works great for parsing reasonably sized XML files into a relational format but if you have very large files, use Hadoop and review my other blog post here.

New Project: gplink

I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.

This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.

gplink v0.0.2
Source Code

This is a new project so I’m looking forward to any feedback you can provide.

Outsourcer 5.0.5

I have made improvements around the starting and stopping of gpfdist processes. It now handles the situation where a gpfdist process doesn’t start properly and instead of failing, it will pick the next available port. I also completely separated osstart and uistart. The User Interface does not need to be running in order for Outsourcer to load tables in the Queue or select from Custom SQL Tables.

This version is a release candidate as testing has shown it is very stable.

Source Code

Outsourcer 5.0.4

Here is another new release of Outsourcer 5. I will likely have another release by the end of the week to further clean up the code so be on the lookout for that. I will update the documentation for release 5.0.5 as well.

I am making changes to Outsourcer 5 due to the move from External Web Table to External Tables using gpfdist. In my testing with 4 concurrent jobs, gpfdist seemed to work fine with only one process handling the load. After stress testing, I have learned that gpfdist is not multi-threaded and a single process can not handle the load of multiple jobs.

I have resolved this issue in 5.0.3 by starting and stopping gpfdist for every job. Upon further stress testing, I identified a potential problem when gpfdist fails to start properly. This would cause multiple jobs to fail in a cascading fashion.

5.0.4 resolves the scenario when gpfdist fails to start. The single job will fail and not affect other jobs.

This release also decouples the User Interface from Outsourcer processing jobs in the Queue. You can run Outsourcer without the User Interface running at all.

I also identified a scenario where Outsourcer 5 is installed on the Master was causing problems. The installer was expecting the installation to happen on the Standby-Master or on a dedicated ETL server so the installer would create a new .pgpass file and it would also backup an existing one. This would remove the gpmon entry for and prevent Command Center from working correctly. 5.0.4 now keeps the gpmon entry, if found, in the .pgpass file. If you have a Standby-Master or an ETL host, install Outsourcer 5 there instead.

And lastly, if you just want a stable release of Outsourcer that isn’t changing frequently, use Outsourcer 4.1.6 which hasn’t changed since January and uses External Web Tables.

Source Code

External Table “TRANSFORM” Option

In the Greenplum Administrator Guide, there is a section that covers loading XML data with gpload. It also mentions that you can create the External Table yourself. This demo will show you how to create an External Table that utilizes this feature so that you can execute any script you want on your ETL server. This demo also works for HAWQ.

XML Parsing is done sequentially from the top to the bottom of a file and the TRANSFORM option was built with this in mind. The gpfdist process will execute a script for you and the output is read by the segments in parallel. Instead of parsing XML with a script, this demo will execute a Unix command in a script to show you how you can leverage this feature to execute virtually any command you want on a remote server.

ETL Server

     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash

As you can see, I created a transform named “transform_demo5” that executes the script “”. So let’s look at

df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2

This simple command executes the Unix df command and converts that to a pipe delimited file. Executing the command outputs this:

[pivhdsne:demo5]$ df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2

Now start a gpfdist process that uses a configuration file. The “-c” option isn’t well documented with gpfdist but it is mentioned with the XML parsing.

gpfdist -p 8999 -c /data1/demo5/demo5.yml 2>&1 > demo5.log &

Now create an External Table. Notice the format of using #transform=transform_demo5. The filename of “foo” is ignored but you can reference the filename as a parameter to your transform scripts.


(Filesystem text,
 K_blocks int,
 Used int,
 Available int,
 Used_percentage text,
 Mounted_on text)
LOCATION ('gpfdist://pivhdsne:8999/foo#transform=transform_demo5')

Now create the table in your database server (Hawq or Greenplum).

psql -f get_df.sql -h gpdbvm43 
Timing is on.
Time: 238.788 ms

Now let’s see the output.

[pivhdsne:demo5]$ psql -h gpdbvm43
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from get_df;
 filesystem | k_blocks |   used   | available | used_percentage | mounted_on 
 /dev/sda3  | 47472560 | 12104924 |  32956168 | 27%             | /
 tmpfs      |  4030752 |        0 |   4030752 | 0%              | /dev/shm
 /dev/sda1  |    99150 |    48764 |     45266 | 52%             | /boot
(3 rows)

Time: 32.149 ms

This is yet another feature of Greenplum and Hawq that gives you more flexibility in working with data. I’ve seen this feature used to move files, change permissions, get data from running programs, and of course parsing XML. Enjoy!

Outsourcer 5.0

Outsourcer 5.0 is now available. This version addresses most of the requests that I have received since the release of 4.0. I’m very excited to complete this work!

Outsourcer now uses External Tables with gpfdist rather than External Web Tables. This is a big change and allows you to install Outsourcer on any host that is accessible to the Greenplum Segment Hosts / HAWQ Data Nodes. It also reduces the load on the Master server.

Installation does more checking and automation.

Root is no longer needed for installation.

start_all and stop_all scripts are provided to ease starting and stopping Outsourcer.

A new set of screens for creating your own Custom External Table with a pre-defined SQL Statement.

You can install multiple copies of Outsourcer to load data to multiple databases.

Source Code

Open Data Platform

Pivotal teams up with Hortonworks for the Open Data Platform. Pivotal is also making the data products (Gemfire, Greenplum Database, and HAWQ) all open source.

Outsourcer 4.1.5 released

Bugs Fixed
1. Error messages raised from Oracle and SQL Server could include special characters which would cause the thread in Outsourcer to die and the status remain labeled as “processing”. Error messages no longer have escape and quote characters to prevent this problem.
2. On new installations, the permissions for the Oracle and SQL Server Jar files would remain owned by root. This has now been corrected to be the admin user (gpadmin).
3. On the “Sources” screen, the label for Append-Only now is correct for Greenplum Database 4.3 and reflects that these are really “Append-Optimized” tables.

1. Added External Tables to enable starting and stopping the UI rather than having to ssh to the Master host.
2. “Queue” screen now provides the ability to cancel processing jobs.
3. Stopping Outsourcer Queue Daemon (Environment Screen), now will cancel currently running Jobs in the Queue.
4. Starting Outsourcer Queue Daemon (Environment Screen), now will make sure orphaned jobs in the Queue will be marked as Failed and if any are currently running, it will cancel these jobs too.


Greenplum Database Connecting to Hortonworks UPDATED

Hortonworks 2.1 is now officially supported by Pivotal in the Greenplum Database version 4.3.3.x and 4.2.8.x by using External Tables.

Greenplum Database 4.3.3 Documentation
Greenplum Database 4.2.8 Documentation

The documentation has the basic configuration but I wanted to provide more detailed instructions.

For my test, I downloaded the Hortonworks HDP 2.1 Sandbox. Next, I started the instance and I created a user named “gpadmin” just to make it easier for me to connect to the cluster.

Pivotal Greenplum Database
Step 1: Downloaded the Greenplum Database Trial Virtual Machine and started the cluster.

Step 2: As root, add the Hortonworks repo

wget -nv -O /etc/yum.repos.d/hdp.repo

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 3: As root, install the Hadoop client

yum install hadoop* openssl

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 4: As root, remove previous OpenJDK and install Oracle JDK. You will need to download the Oracle JDK from Oracle.

yum remove java
rpm -i jdk-7u67-linux-x64.rpm

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 5: As gpadmin, start the Greenplum database and set the Hadoop target version to Hortonworks

gpconfig -c gp_hadoop_target_version -v hdp2

Step 6: As gpadmin, edit your .bashrc and add the following

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/hadoop/client

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 7: As gpadmin, restart the database

source .bashrc
gpstop -r -a

Step 8: As root, add host entry to the Master and Standby-Master for the Hadoop cluster

echo " hdp21" >> /etc/hosts

Note: This will need to be done on all nodes in a full Greenplum cluster and all Hortonworks nodes should be in the hosts file too. DNS can also be used.

Build Tables

(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 60 ms.
INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;
--Query returned successfully: 100 rows affected, 2636 ms execution time.
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 28 ms.
SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;

Maintenance Script for HAWQ and Greenplum database

A common question I get is what maintenance activity is needed in HAWQ and Greenplum database. There are lots of ways to handle this and a lot of customers do it as part of their ETL processing. I put together a script that is pretty generic and should be very useful for most implementations.

The script is available on Github in a new project I created for scripts. As I come up with more scripts for various activities, I will put it in this Github repository.

Link to repository
Note: Run this script on a regular basis such as weekly or monthly.


XML Parsing

I recently worked with a customer that receives very large and complex XML files from external partners. This customer wanted the XML files parsed and available for SQL access so they can do reporting and analytics.

There are many ways to handle XML files but in this case in which I had very large files, I needed a cluster of machines and Hadoop is pretty good at that. The processing can be done with Map Reduce or a tool like Pig which simplifies Map Reduce.

Solution 1

  • Load raw file to Hadoop
  • Transform XML to tab delimited file with Pig
  • Create External Table in HAWQ to read file data in Hadoop

Sample XML file.

<?xml version="1.0"?>

As you can see, I have two record sets of large products and small products but I just want the small products in a table.

Fist, put the raw XML data into Hadoop.

hdfs dfs -mkdir /demo4
hdfs dfs -put catalog.xml /demo4

Here is the Pig script.

REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' 
AS (doc:chararray);

clean = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<small-product>\\s*<name>(.*)</name>\\s*<price>(.*)</price>\\s*</small-product>'))
AS (name:chararray,price:int);
store clean into '/demo4/alt_small_data';

What Pig is doing for me is to first only get the small-product records. This only requires a single line in the script and is very useful. The next step is to use regular expressions to parse each tag. This is very painful to get right because Pig use Map Reduce to parse the data. This is powerful but relatively slow to iterate until you get it right. Even with a small file, each iteration took at least 30 seconds to execute and the full file took 22 minutes.

The last step is to create an External Table in HAWQ.

  name text, price int
 FORMAT 'text' (delimiter E'\t');

And selecting the data in HAWQ.

SELECT * FROM ext_alt_demo4;
 name | price 
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 127.334 ms

This was my first approach for XML parsing until I got frustrated with the many XML tags to create regular expressions for. The XML I had wasn’t as neat as my example so I had to re-run the Pig script over and over again for each slight modification to the parsing logic.

Solution 2
This the same basic process as Solution 1 but instead of parsing each record with regular expressions in Pig, I will create a single column and do the parsing with SQL in HAWQ.

Here is my Pig script.

REGISTER /usr/lib/gphd/pig/piggybank.jar;
A = LOAD '/demo4/catalog.xml' USING'small-product') AS (doc:chararray);
clean = foreach A generate REPLACE(REPLACE(doc, '\\u000D', ''), '\\u000A', '');
store clean into '/demo4/small_data';

So instead of using regular expressions, I’m replacing carriage return and newline characters from the XML so that each record is in one row. Then I store that back in Hadoop.

Here is the External Table in HAWQ.

xml_data text
FORMAT 'TEXT' (delimiter E'\t');

I then created a simple SQL function to parse the data.

CREATE OR REPLACE FUNCTION fn_extract_xml_value(p_tag text, p_xml text) RETURNS TEXT AS
SELECT SPLIT_PART(SUBSTRING($2 FROM '<' || $1 || '>(.*)</' || $1 || '>'), '<', 1)

And my SQL statement that parses the data.

SELECT (fn_extract_xml_value('name', xml_data))::text AS name, (fn_extract_xml_value('price', xml_data))::int AS price FROM ext_demo4;                 
 name | price 
 bar1 |    10
 bar2 |    20
 bar3 |    30
 bar4 |    40
 bar5 |    50
(5 rows)

Time: 94.887 ms

The benefit for me in this second approach is the huge performance increase in the iterative approach of getting the XML parsing correct. Instead of taking several minutes to validate my code in Pig, I could execute a SQL statement that takes less than 1 second to run. It took another quick second to modify the SQL function and then I would try again.

Hadoop is powerful and has become commodity software with many distributions available that are all pretty much the same. The difference in distributions is the software that is unique to each vendor. Some vendors rely on their management tools while Pivotal HD has HAWQ which is the most robust SQL engine for Hadoop. This example shows how you can leverage the built-in functionality of Hadoop plus HAWQ to be more productive compared to using any other Hadoop distribution.

When to ANALYZE in Greenplum and HAWQ?

Table statistics gives the cost based optimizer information needed to build the best query plan possible and this information is gathered with the ANALYZE command. But when should you execute an ANALYZE on a table? Doesn’t Grenplum and HAWQ do this automatically?

Greenplum and HAWQ will perform an ANALYZE automatically for you so the query optimizer will have good statistics and build a good plan. Greenplum and HAWQ also allow you to configure this if needed with gp_autostats_mode and gp_autostats_on_change_threshold.

This specifies when an automatic ANALYZE should take place. The values for this configuration parameter can be:

The default is on_no_stats and the database will automatically perform an ANALYZE for you. If you INSERT data into an empty table (new table or a table you just did a TRUNCATE on), the database will automatically gather statistics with an ANALYZE.

Typically in an Analytics Data Warehouse, you will insert data once and then read it many times. So the default of on_no_stats will automatically give you statistics as needed with no additional work from you.

Self explanatory. You have to execute ANALYZE if you want statistics.

With this setting, if perform an INSERT, UPDATE, or DELETE that exceeds the gp_autostats_on_change_threshold value, then an automatic ANALYZE will happen. Note for HAWQ, you can only INSERT.

In Greenplum and in the rare case where you are doing a DELETE or UPDATE to a table, you will need to execute an ANALYZE statement. Or, you can set the database to use on_change and the ANALYZE will happen automatically.

This is only relevant for on_change and it is the number of rows that need to change before an automatic ANALYZE will happen. The default is over 2 billion rows so if you really want to use on_change, then you will also need to reduce this configuration parameter to something more realistic.


Example 1 – The default
gp_autostats_mode = on_no_stats
gp_autostats_on_change_threshold = 2147483647

SELECT state, count(*) AS counter
FROM customer
GROUP BY state
--Statistics will be gathered automatically.
CREATE TABLE my_table 
(state varchar(100),
 counter int)

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics won't be updated.  The planner will think the table has only 1 row.

ANALYZE my_table;
--Statistics manually gathered and correctly shows the correct number of rows in the table.

Example 2 – Using on_change
gp_autostats_mode = on_change
gp_autostats_on_change_threshold = 1000000

SELECT state, count(*) AS counter
FROM customer
GROUP BY state
--Statistics will be gathered automatically only if the number of rows is 1M or more.
CREATE TABLE my_table 
(state varchar(100),
 counter int)

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will not be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

Checking the Statistics
And here are a couple of queries you can use to see the statics information.

SELECT c.oid, c.relpages, c.reltuples 
FROM pg_class c 
JOIN pg_namespace n ON c.relnamespace = n.oid 
WHERE n.nspname = 'public' 
AND c.relname = 'my_table';
--using the oid from the previous query
SELECT * FROM pg_statistic WHERE starelid = 298610;

Greenplum and HAWQ automatically gather statistics for you in most cases. If you are doing lots of DML activity, you can change the configuration to still automatically gather statistics for you too. These automatic settings make life easier for DBAs, Developers, and Analysts but still give you the flexibility to configure it in the best way for your environment.

Greenplum Database Incremental Backup

Greenplum Database has a feature called “incremental” backup but really it is a incremental differential backup. In this post, I will explain how it works and give you a working example.

AO Tables
In version 4.2 and earlier, AO tables are Append Only but starting in 4.3, the tables are Append Optimized. Everyone using 4.3 should be using AO tables as it uses less space, supports this incremental feature, and executes faster. Incremental backup requires AO tables.


CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);

Using the incremental option on a regular heap table will not make any difference. It will backup all heap tables with every backup.


First create a database.

gpdb=# drop database poc;
gpdb=# \c poc
You are now connected to database "poc" as user "gpadmin".

Now create an AO table and insert some data.

poc=# CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);
poc=# INSERT INTO foo VALUES (1, 'bar_1');
poc=# INSERT INTO foo VALUES (2, 'bar_2');

I’m next going to execute a full backup. For the demo, I’m going to turn off compression so it is easier to view the backup files.

I’m using gpcrondump with these switches:
-x specifies the database to backup
-a means automatic and don’t prompt
-z turns off compression

bigmac:~ gpadmin$ gpcrondump -x poc -a -z
20140909:13:39:19:006156 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z

Just to see what is in the backup file, I’m going to cat one of the segment’s backup files. It shows one of the two records in this tiny table.

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909133919 
-- Greenplum Database database dump

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

SET search_path = public, pg_catalog;

SET default_with_oids = false;

-- Data for Name: foo; Type: TABLE DATA; Schema: public; Owner: gpadmin

COPY foo (id, bar) FROM stdin;
1	bar_1

-- Greenplum Database database dump complete

Now I immediately run an incremental backup. Nothing changed in my database so it should be an empty backup set.

bigmac:~ gpadmin$ gpcrondump -x poc -a -z --incremental
20140909:13:41:19:006395 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z --incremental

To verify:

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909134119 
-- Greenplum Database database dump

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

-- Greenplum Database database dump complete

So my incremental backup doesn’t have any data in it. If I had deleted or inserted any data to my one table, the entire table would be part of the backup set. That is why it is an incremental differential backup. If there is a difference in the AO table (or partition), the entire table/partition will be included in the incremental backup.

So now that I have one full backup and one incremental, let’s drop the table and then restore it.

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# DROP TABLE foo;
poc=# \q

And now the restore using gpdbrestore. I’m using the following switches:
-a means automatic and don’t prompt
-e will drop the target database and then recreate it
-s uses the latest backup set

bigmac:~ gpadmin$ gpdbrestore -a -e -s poc
20140909:13:42:58:006720 gpdbrestore:bigmac:gpadmin-[INFO]:-Starting gpdbrestore with args: -a -e -s poc

Verify the restore worked:

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# SELECT * FROM foo;
 id |  bar  
  1 | bar_1
  2 | bar_2
(2 rows)


Both commands used here, gpcrondump and gpdbrestore, have lots of options to backup and restore respectively. Using the incremental switch on gpcrondump will save you disk space and execution time while still giving you the ability to do a full system restore or even a selective table restore.

Hadoop Distributions

Today’s landscape of Hadoop vendors is mostly comprised of privately held companies that have big investments and partnerships with well established companies. This is very similar to the MPP marketplace in 2010 when these companies were purchased by much larger companies.

Here are a few of the MPP database vendors with their buyers:
Greenplum => EMC
Netezza => IBM
DATAllegro => Microsoft
Aster Data => Teradata
Vertica => HP

Similarly, I think the landscape of Hadoop vendors will change in the near future. Here are the major vendors in this Hadoop space as of September 2014:


  • Private
  • Investments: 2011 – $40M; 2014 – $900M
  • Around 600 employees
  • Founded in 2009
  • Partners with Oracle, Intel (funding), and Amazon (but also competes with Amazon)


  • Private
  • Investements: 2011 – $23M + $25M
  • 201-500 employees
  • Founded in 2011
  • Partners with Yahoo, Teradata, and SAP


  • Public
  • $100B Revenue / year
  • 400K employees
  • Founded in 1911


  • Private
  • Investments: 2009 – $9M; 2014 – $110M
  • 201-500 employees
  • Founded in 2009
  • Partners with Google


  • Private
  • Investments: 2013 – $100M from GE and assets from EMC and VMWare
  • 3000+ employees
  • Founded in 2013 (Pivotal), 2003 (Greenplum), 1998 (VMWare) and 1979 (EMC)
  • Partners with EMC, VMWare, and GE


  • Public
  • $75B Revenue / year
  • 132K employees
  • Founded in 1994

Hadoop Vendors Tomorrow
Cloudera => Oracle or Amazon
It will probably be Oracle because of the existing partnership and leadership that came from Oracle but Amazon may want it more. If Oracle doesn’t buy Cloudera, they will probably try to create their own distribution like they did with Linux.

Hortonworks => Teradata
It is only a matter of time before Teradata will have to buy Hortonworks. Microsoft might try to buy Hortonworks or just take a fork of the Windows version to rebrand. Microsoft worked with Sybase a long time ago with SQL Server and then took the code and ran rather than buying Sybase. So because of that history, I think Microsoft won’t buy and Teradata will.

Teradata bought Aster Data and Hortonworks would complete their data portfolio. Teradata for the EDW, Aster Data for Data Marts, and Hortonworks for their Data Lake.

MapR => Google
Google will snatch up MapR which will make MapR very happy.

So that leaves IBM and Amazon as the two publicly held companies left. Pivotal is privately held but by EMC, VMWare, and GE which gives all indications based on past actions by EMC that this company will go public and be big.

Post Acquisitions
So after the big shakeup, I think you’ll see these vendors remaining selling Hadoop:

  • Pivotal: 100% Apache based with the best SQL Engine
  • IBM: Big Insights
  • Teradata: Hortonworks
  • Oracle: Cloudera
  • Google: MapR
  • Amazon: Elastic MapReduce

I could be wrong but I really do think there will be a consolidation of vendors in the near future.

Greenplum 4.3 External Tables to Hortonworks 2.0

The following instructions are valid for connecting Greenplum Database to Hortonworks HDP 2.0. Note that this technique is not officially supported by Pivotal and Pivotal has added supported for HDP 2.1 in Greenplum Database version 4.3.3 and 4.2.8. See this post for more information.

Hortonworks (HDP) 2.0 is based on Apache Hadoop 2.2 which is also what Pivotal HD 2.0 is based on. If you haven’t read already, I was able to demonstrate how Greenplum can read and write data to a Pivotal HD cluster in parallel. You can use this same basic configuration to integrate Grenplum to Hortonworks 2.0!

Follow the same steps in configuring Greenplum database for Pivotal HD 2.0. We’ll use the same client libraries to connect to HDP 2.0. Next create a user in HDP 2.0 named gpadmin. Lastly, create the same types of External Tables but use the IP address of the HDP cluster.

(i int, bar text) LOCATION ('gphdfs://') FORMAT 'text' (delimiter '|' null 'null');

INSERT INTO ext_foo_hdp20 SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;

CREATE EXTERNAL TABLE ext_get_foo_hdp20
(i int, bar text) LOCATION ('gphdfs://') FORMAT 'text' (delimiter '|' null 'null');

Here are the files Greenplum created in the Hortonworks cluster.

[gpadmin@sandbox ~]$ hadoop version
Subversion -r 8656b1cfad13b03b29e98cad042626205e7a1c86
Compiled by jenkins on 2013-10-18T00:19Z
Compiled with protoc 2.5.0
From source with checksum d23ee1d271c6ac5bd27de664146be2
This command was run using /usr/lib/hadoop/hadoop-common-
[gpadmin@sandbox ~]$ hdfs dfs -ls /user/gpadmin/foo_bar/
Found 2 items
-rw-r--r--   3 gpadmin gpadmin        490 2014-08-05 06:19 /user/gpadmin/foo_bar/0_1407266192-0000000091
-rw-r--r--   3 gpadmin gpadmin        494 2014-08-05 06:19 /user/gpadmin/foo_bar/1_1407266192-0000000091

Here is the data from HDP in Greenplum.

postgres=# SELECT * FROM ext_get_foo_hdp20 ORDER BY i LIMIT 10;
 i  |  bar   
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)


I suspect the integration will get better and better between Greenplum and the various Hadoop distributions because for the most part, all of the distributions rely on the Apache distribution of Hadoop. Look for more to come in the future too!

HAWQ with Parquet Files

Parquet is a format for column oriented data to be stored in HDFS. It is part of the Apache distribution and is also available in Pivotal HD and HAWQ. HAWQ can store and read data in the Parquet format and it is also available with the open source components of Pivotal HD such as Pig and MapReduce.

Here is a quick example showing how this work.

First, create a table with some data. You can either use the CTAS method or the more traditional CREATE TABLE and then INSERT. Either method works and it is up to your preference on which you do. This example generates only 100 records of some fake customer data.


WITH (appendonly=true, orientation=parquet)
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub


  id integer,
  fname text,
  lname text,
  address text,
  city text,
  state text,
  zip text

INSERT INTO customer
SELECT i AS id, 'jon' || i AS fname, 'roberts' || i AS lname, i::text || ' main street'::text AS address, 'new york'::text AS city, 'ny'::text AS state, lpad(i, 5, '0') AS zip
FROM (SELECT generate_series(1, 100) AS i) AS sub;

Now you have data in the Parquet format in HAWQ. Pretty easy, huh?

Next, I’ll use a nifty tool that queries the HAWQ catalog which tells me where the Parquet files are.

gpextract -o customer.yaml -W customer -dgpadmin

And here is the customer.yaml file it created.

DBVersion: PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ build
  8119) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Apr
  23 2014 16:12:32
DFS_URL: hdfs://
Encoding: UTF8
FileFormat: Parquet
  Checksum: false
  CompressionLevel: 0
  CompressionType: null
  EnableDictionary: false
  - path: /hawq_data/gpseg0/16385/16554/16622.0
    size: 4493
  - path: /hawq_data/gpseg1/16385/16554/16622.0
    size: 4499
  PageSize: 1048576
  RowGroupSize: 8388608
TableName: public.customer
Version: 1.0.0

Notice the path to the files which are in Hadoop and are in the Parquet format.

Now you can use a tool like Pig to look at the data.

grunt> A = load '/hawq_data/gpseg{0,1}/16385/16554/16622' USING parquet.pig.ParquetLoader();
grunt> describe A;                                                                          
A: {id: int,fname: bytearray,lname: bytearray,address: bytearray,city: bytearray,state: bytearray,zip: bytearray}
grunt> B = foreach A generate id, fname, lname, address, city, state, zip;
grunt> dump B;
(2,jon2,roberts2,2 main street,new york,ny,00002)
(4,jon4,roberts4,4 main street,new york,ny,00004)
(6,jon6,roberts6,6 main street,new york,ny,00006)
(8,jon8,roberts8,8 main street,new york,ny,00008)

Parquet is easy to use in HAWQ and doesn’t lock you into a Pivotal HD and HAWQ only solution. It is easy to use the other tools like Pig or MapReduce to read the Parquet files in your Hadoop cluster. No vendor lock-in.

HAWQ versus Greenplum Database

HAWQ is a port of Greenplum database to work with the HDFS file system from Hadoop. It is only available with the Pivotal HD distribution of Hadoop even though Pivotal HD is 100% Apache Hadoop compliant.

This post will focus on the major differences in the filesystem, External Tables, DDL commands, and DML commands. I’m comparing Greenplum database version 4.3 with HAWQ version which is shipped with Pivotal HD version 2.0.1.

HDFS versus Posix Filesystem
With Greenplum database, each segment’s files are always local to the host. Even in a failover condition when the mirror is acting as the primary, the data is local to the processing.

With HAWQ, HDFS handles high availability by having three copies of the data across multiple nodes. Because of this, the mirroring that is built into the Greenplum database is removed from HAWQ. If a node were to fail, then Hadoop automatically creates a third copy of the data. So a segment running on a Hadoop data node may not have the data it needs local to it and will need to get data from other physical nodes.

External Tables
Both HAWQ and Greenplum database have External Tables but differ when accessing external data in Hadoop.

HAWQ has PXF which is the Pivotal Extension Framework. It has the ability to access files in HDFS stored as plain text but also in Hive, Hbase, Avro, and Gemfire XD. You can write your own custom profiles to get data from HDFS. PXF can also get some statistics about these files so the optimizer is smarter when accessing these External Tables.

Greenplum database doesn’t have PXF but does have GPHDFS. GPHDFS enables Greenplum database to read and write data to HDFS. It doesn’t have built-in capabilities to Avro, Hive, HBase, and Gemfire XD. It also doesn’t have statistics for these External Tables.

HAWQ is great at exploring and transforming data in Hadoop while Greenplum database is great at bulk loading data from Hadoop into the database as well as bulk writing data from Greenplum database into Hadoop. So land all of your data in Hadoop, transform it with SQL and then create data marts in Greenplum database.

Both offer functions but HAWQ doesn’t have SECURITY DEFINER functions yet.

DDL Commands
Here is a list of commands that are in Greenplum database but not in HAWQ:

  • CREATE AGGREGATE: user defined aggregate like SUM and COUNT.
  • CREATE CAST: user defined conversion of two datatypes.
  • CREATE CONVERSION: user defined conversion of character set encodings.
  • CREATE DOMAIN: user defined datatype with optional constraints.
  • CREATE INDEX: indexes aren’t supported in HAWQ.
  • CREATE OPERATOR: user defined operator like != is the same as <>.
  • CREATE OPERATOR CLASS: user defined class of how a data type is used within an Index.
  • CREATE RULE: user defined filter placed on a table or view like “gender_code in (‘M’, ‘F’)”
  • CREATE TABLESPACE: user defined directory to be using in Posix filesystem to store database objects.
  • CREATE TRIGGER: user defined trigger for a table. Note that this is very limited in Greenplum database.

DML Commands
HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does.

There are other small differences between the two products but these are the major ones.

Pivotal HD and Greenplum Database Integration

So you are thinking about building a Hadoop cluster or already have one and wondering how you can get data to and from Hadoop into an MPP database like Greenplum Database (GPDB). With GPDB, this can be done in parallel which makes it the ideal solution for an Analytics, Reporting, or Data Warehousing environment.

In this example, I will generate some dummy data with a SQL statement in GPDB and load it into PHD with an INSERT statement. I will then read the data back out of PHD with a SELECT statement.

One time steps Configuration

1. Download Pivotal HD that matches your version.

2. Copy the file to the Master server in the GPDB cluster. In my example, gpdbvm43 is single node VM with 2 segments.

scp PHD- root@gpdbvm43:/root

3. ssh as gpadmin to GPDB 4.3 instance

vi .bashrc
[add these entries]
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/gphd

Note: this needs to be repeated on every host in the cluster. You can use gpscp or gpssh to make these changes to all hosts too.

4. Make sure the database is running.

gpstart -a

5. Change the gp_hadoop_target_version to be compatible with Pivotal HD 1.0 and greater.

gpconfig -c gp_hadoop_target_version -v gphd-2.0

6. Add an entry to your /etc/hosts for all of the hosts in the Hadoop cluster. This needs to be done on all hosts in the GPDB cluster too. I’m using a single node VM of PHD so I just have one entry.

vi /etc/hosts 
[add every node in the PHD cluster] pivhdsne.localdomain pivhdsne

Note: You can then use gpscp to copy the revised hosts file to the other hosts in the cluster.

7. Install the PHD client in the GPDB cluster.

su - 
tar --no-same-owner -zxvf PHD-
cd PHD-
rpm -i *.rpm
cd ../../zookeeper/rpm
rpm -i *.rpm
cd ../../hadoop/rpm
yum install nc
rpm -i *.rpm

Note: You can use gpscp to copy the tar.gz file to the other hosts in the cluster and then use gpssh to execute these commands. Be sure to source the after connecting as root. “nc” may not be needed on your cluster but was required with my VM.

8. Now that you are gpadmin again, bounce the database.

gpstop -r

Pivotal HD Configuration
You likely already have this done but if you are using the single node VM of Pivotal HD, then you will need to edit your /etc/hosts file there so that Hadoop is accessible remotely.

1. ssh as root to the VM

[root@pivhdsne ~]# ifconfig
eth1      Link encap:Ethernet  HWaddr 00:0C:29:20:A3:8F  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::20c:29ff:fe20:a38f/64 Scope:Link
          RX packets:9168 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1199 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:653659 (638.3 KiB)  TX bytes:199320 (194.6 KiB)

lo        Link encap:Local Loopback  
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3779367 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3779367 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4407394192 (4.1 GiB)  TX bytes:4407394192 (4.1 GiB)

Now vi /etc/hosts and change to the IP address of the VM.   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6 pivhdsne pivhdsne.localdomain

GPDB External Tables
1. Example of Writable External Table to PHD.

(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

2. Insert some data into PHD from GPDB.

INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;

3. Create External Table to view the data in PHD.

(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

4. Select the data.

postgres=# SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
 i  |  bar   
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

Here is how you can see that it was done in parallel. Notice there are two files in the /foo_bar directory that I specified in the External Writable Table above.

[pivhdsne:~]$ hdfs dfs -ls /foo_bar
Found 2 items
-rw-r--r--   3 gpadmin hadoop        490 2014-05-24 00:19 /foo_bar/0_1400790662-0000004541
-rw-r--r--   3 gpadmin hadoop        494 2014-05-24 00:19 /foo_bar/1_1400790662-0000004541

There are two files because my single node VM of GPDB has two Segments. Each Segment wrote its file to Hadoop at the same time. Completely parallel and scalable.

More proof!

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/0_1400790662-0000004541 | more

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/1_1400790662-0000004541 | more

Extra Credit!
Log into HAWQ on the PHD cluster and create an External Table to the same files!

(i int, bar text) LOCATION 
('pxf://pivhdsne:50070/foo_bar?profile=HdfsTextSimple') FORMAT 'text' (delimiter '|' null 'null');

gpadmin=# SELECT * FROM ext_get_foo ORDER BY i limit 10;
 i  |  bar   
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

As you can see, there are lots of ways to move data between Greenplum database and Hadoop to satisfy a large variety of use cases to solve business problems.


I get spam posted to the comments section from time to time and this one is so funny that I had to share.

Traffic jam happened, after visiting Yosemite National Park is a well known as a type of government is also a few., (link to internet casino removed)

It looks like a bunch of English words thrown together to make a sentence and a link to a casino. Funny.