Category Archives: Greenplum

Greenplum on Cloud Utilities

Greenplum on AWS, GCP and Azure have so many features that make it easier to run and manage Greenplum in the cloud.

gppower

gppower is a new utility that integrates with each cloud provider to automate the pause and resume of the cluster.

The pause command will stop Greenplum, Command Center, and even PXF before it then suspends the segment virtual machines. This is designed to save you money because the IaaS providers do not charge for the compute time of stopped VMs.

The resume command reverses this by starting the VMs and then starting Greenplum, Command Center and PXF.

gpgrow

This utility also integrates with the IaaS and allows you grow storage independently of compute. Each IaaS has different attributes in regards to this feature worth mentioning.

Growing the disk size is online for AWS and GCP. The database doesn’t even need to be stopped. For Azure, gpgrow will automatically stop and deallocate the VMs so that the disks can be modified. It will then start the VMs and Greenplum before expanding the disks.

gpmaintain & gpcronmaintain

This utility automates the common maintenance routines needed for Greenplum. The “cron” version makes it easier to schedule your maintenance. The configuration file is in /usr/local/greenplum-cloud/conf/.

The utility automates the execution of analyze and vacuum. It checks for bloated tables too.

gpoptional

This utility makes it easier to install optional components like Command Center. Just run it at the command line to see the list of optional components.

bouncer

This utility automates the start, stop, pause, and resume of pgBouncer. It is intended to make it easier to use the load balancer.

gpsnap & gpcronsnap

This has been a popular utility! It automates the backup of the data volumes in your cluster for extremely fast backups. The backups are executed in parallel and can be used for disaster recovery.

Instead of copying the database to backup files, gpsnap relies on the IaaS disk snapshots to create a consistent backup of your database.

The “cron” version of the utility makes it easier to schedule the execution of gpsnap. The configuration file is in /usr/local/greenplum-cloud/conf.

gprelease & gpcronrelease

This utility automates software upgrades and when upgrades are available, the Message of the Day (the information displayed when you ssh to the master host) gets updated to let you know there is a new version available.

The utility not only upgrades the database but also upgrades dependent packages like MADlib and PostGIS.

Summary

Running Greenplum in the cloud is getting easier and easier because of utilities and automation that is available with Greenplum in the cloud. So be sure to check us out in your favorite cloud provider’s marketplace!

gpbackup and gprestore with S3

These tools replace the older gpcrondump and gpdbrestore utilities. A great feature for these newer tools is the ability to use S3 directly. This post will give a quick example on how to do this.

First, create a cluster and create a table to demonstrate the backup and restore.

psql -c "create table foo as select i from generate_series(1,1000000) as i distributed by (i);"

Next, create an s3_config.yaml file. You’ll need an Access key ID from AWS as well as the Secret Access key. This file should have the following:

executablepath: $GPHOME/bin/gpbackup_s3_plugin
options: 
  region: <your AWS region>
  aws_access_key_id: <your access key>
  aws_secret_access_key: <your secret access key>
  bucket: <S3 Bucket name>
  folder: <Path to store your backups in S3>

Next, you can execute a backup.

gpbackup --dbname gpadmin --plugin-config /home/gpadmin/s3_config.yaml 

Pretty simple, right?

To restore, you can follow these simple steps. (You will need to update the timestamp value for the backup you wish to restore.)

dropdb gpadmin
gprestore --plugin-config /home/gpadmin/s3_config.yaml --timestamp 20190823142028

Installing Greenplum with rpm

Pivotal Greenplum 5.x has two main installers. One is the older “binary” installer and the other is an rpm installer. With Greenplum 6.x, there will only be a single rpm installer. So, you need to get used to installing with rpm. This post will cover installing the Greenplum software on a cluster using rpm.

Step 1

First you need to install the binaries on your Master node. This is pretty simple. Copy the rpm file to you Master node, connect to the Master node as gpadmin and then install the software.

scp -i private_key greenplum-db.rpm gpadmin@master_node:/home/gpadmin/
ssh -i private_key gpadmin@master_node
sudo rpm -i greenplum-db.rpm
sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db*
echo "source /usr/local/greenplum-db/greenplum_path.sh" >> /home/gpadmin/.bashrc
source /usr/local/greenplum-db/greenplum_path.sh

Step 2

Note: This step assumes you have already exchanged keys across your cluster either manually or with gpssh-exkeys.

Copy the rpm file to all other nodes in the cluster and then run the rpm -i command again. You’ll need a file that lists all nodes in the cluster except the master.

for i in $(cat all_hosts_except_master.txt); do scp greenplum-db.rpm gpadmin@$i:/home/gpadmin/; done
gpssh -f all_hosts_except_master.txt "sudo rpm -i /home/gpadmin/greenplum-db.rpm"
gpssh -f all_hosts_except_master.txt "sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db*"

Done

There are lots of ways of achieving this with automation tools but this is one easy way to install the rpm on all nodes in your cluster.

Managing Resource with Resource Groups

You might have missed the addition of Resource Groups to Greenplum Database version 5 but it is something worth looking at. It is an alternative to Resource Queues for managing resources but because of how it works, it makes the database faster!

Resource Queues prioritize and allocate resources and it does this at the process level in the operating system.

Resource Groups, on the other hand, rely on the OS Kernel to manage and prioritize resources. This alone, removes overhead of managing resources thus making it faster!

Resource Groups have more control over managing resources too so you can better control your resources. So not only is it faster, it gives you greater control over the cluster.

Implementing Resource Groups is easy too.

gpconfig -c gp_resource_manager -v group
gpstop -a 
gpstart -a

More information on Resource Groups are in the Admin Guide: https://gpdb.docs.pivotal.io/5170/admin_guide/workload_mgmt_resgroups.html

Fun with dblink and Functions

Greenplum includes several optional modules from PostgreSQL. One of these is dblink which enables connecting to a remote PostgreSQL or Greenplum database with SQL.

First, run this command as gpadmin:

psql -f /usr/local/greenplum-db/share/postgresql/contrib/dblink.sql

Doc: https://gpdb.docs.pivotal.io/5180/utility_guide/dblink.html

If you are superuser like gpadmin, you can use the connect function to a database in the same cluster without specifying the password like this:

PERFORM dblink_connect('blog_test', 'dbname=blog_db');

So now you can execute some SQL in the other database. It is also in a separate transaction which gives you the ability to have an autonomous transaction inside of a Function.

v_sql := 'CREATE TABLE foo (id int) DISTRIBUTED BY (id)';
PERFORM dblink_exec('blog_test', v_sql, true);
PERFORM dblink_disconnect('pws_test');

If you do all of this as a superuser, the function will execute the dblink successfully but if you grant EXECUTE on the function to a non-superuser, it will fail. Why? Because by default, the Function executes as the user that is executing the Function, not the owner.

The easy fix for this is to use SECURITY DEFINER when creating the function. This means the Function executes as the owner of the Function. You could have all sorts of code in your Function that requires a superuser to execute and grant it to a non-superuser. This includes dblink!

Doc:https://gpdb.docs.pivotal.io/5180/ref_guide/sql_commands/CREATE_FUNCTION.html#topic1

Have fun!

Greenplum Summit

I will be presenting at the Greenplum Summit which is held in New York City March 18-22. I’m really looking forward to the event and speaking with customers. If you attend, be sure to look for me and let me know you use this site!

  • March 18-22 in New York City
  • Use code 2019_GREENPLUM to get a 25% discount off any pass type when you Register
  • Hands On Training Pre-Conference


Greenplum Master High-Availability in the Cloud

Master Node
The Greenplum architecture uses a Master node for client connections. The Master also generates and dispatches the query plan to the Segments and then returns results back to the clients. For the most part, the Master node is relatively idle because the data is distributed across multiple segments on other hosts.

Standby-Master
If the Master node were to fail, the database goes offline so to mitigate this risk and provide high-availability, Greenplum has a Standby-Master process that runs on a separate host. Just like the Master, the Standby-Master is usually pretty idle. Sometimes it is also used as an ETL node so you aren’t wasting money by having an extra, idle machine.

In a physical data center, a node replacement may take days so having a physical standby-master is typically a requirement. This isn’t a fault with Greenplum but a fault of the realities in replacing physical hardware.

Cloud
Deployments in the Cloud, however, can replace a bad node in minutes. So do you really need a VM just for the Standby-Master process? Furthermore, what happens when you automate the node replacement so that the cluster “Self-Heals”?

Cloud Marketplaces
In the AWS, Azure, and GCP Marketplaces, Pivotal Greenplum is deployed with a single Master node and the Standby-Master process runs on the first Segment host. If any node in the cluster fails, the cluster Self-Heals. The process is different in AWS versus GCP and Azure but the concept is the same. A new VM is created and the commands to recover the cluster back to 100% are executed automatically.

You get to save money by not having a dedicated Standby-Master node and you still get the high-availability you want with Greenplum.

gpsnap Video

gpsnap is a utility currently available for Greenplum in AWS and Azure. It automates backup and recovery with AWS EBS snapshots and Azure Disk snapshots. For AWS, you can also copy these snapshots to a different AWS Region which is an ideal solution for Disaster Recovery.

Check it out!

Monitoring Sessions

I recently found out that one of my assumptions in Greenplum was completely wrong! After all of these years, I didn’t understand how pg_stat_activity worked!

I always thought that each row in this system table equated to a single session so, if I wanted to get the number of sessions logged in, I could do this:

select count(*) from pg_stat_activity;

But, this is wrong! One row actually means one server process and not one session.

When a session submits a query, pg_stat_activity may report multiple rows for that single session as it execute the query. Why? A single query can spawn multiple backend processes and these backend processes can show up in pg_stat_activity.

Example – Query 22 from TPC-DS
More Information on TPC-DS

select i_product_name
,i_brand
,i_class
,i_category
,avg(inv_quantity_on_hand) qoh
from inventory
,date_dim
,item
where inv_date_sk=d_date_sk
and inv_item_sk=i_item_sk
and d_month_seq between 1212 and 1212 + 11
group by rollup(i_product_name
,i_brand
,i_class
,i_category)
order by qoh, i_product_name, i_brand, i_class, i_category
limit 100;

First, look at pg_stat_activity and see that I have two sessions open. One is executing the simple query from pg_stat_activity (sess_id=1013) and the other is idle (sess_id=1011).

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
(2 rows)

Now, execute Query 22 from sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | select i_product_nam
   31137 |    1011 | select i_product_nam
(3 rows)

I now see that sess_id=1011 has started a second process to complete the query. It is still a single database session but pg_stat_activity now shows two rows for this session.

Even after the query is over, I can still see the idle process created for sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
   31137 |    1011 | <idle>
(3 rows)

So how do you get the number of sessions from pg_stat_activity? With this query:

gpadmin=# select count(*) from (
gpadmin(#     select sess_id from pg_stat_activity group by sess_id
gpadmin(# ) as sub;
 count 
-------
     2
(1 row)

Default Storage Options

When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.

Here are the defaults:

gpadmin=# show gp_default_storage_options;
                            gp_default_storage_options                            
----------------------------------------------------------------------------------
 appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row

And a quick test, shows how you can default to something else.

gpadmin=# set gp_default_storage_options='appendonly=true';                                                                                                                                               SET
gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i);
SELECT 1000
gpadmin=# \d foo
Append-Only Table "public.foo"
 Column |  Type   | Modifiers 
--------+---------+-----------
 i      | integer | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Distributed by: (i)

You can make the changes permanent with gpconfig.

gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'

And you can make it limited to a single database.

ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';

It’s a Snap!

gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.

EBS Snapshot
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.

EBS Restore
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.

gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.

The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete : deletes a specific snapshot
– gpsnap delete all: deletes all snapshots
– gpsnap restore : restores a specific snapshot

gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.

So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!

You can either Bring Your Own License (BYOL) or pay Hourly for Greenplum on AWS. Here are the links:
Pivotal Greenplum on AWS (BYOL)
Pivotal Greenplum on AWS (Hourly)

Pivotal Greenplum Version 5.x

Version 3.x
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.

Version 4.x
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.

Labs Methodology
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.

Version 5.x
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.

Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.

The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.

Version 6.0?
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.

AWS Marketplace

I haven’t posted much here lately because I’ve been working on deploying GemFire and Greenplum in the AWS Marketplace. It has been interesting work so I thought I would make a post about it here.

Amazon Web Services (AWS)
Amazon has the most popular cloud in the US with many products available which now includes Pivotal Greenplum and Pivotal GemFire. Both products are available as Bring Your Own License (BYOL) or Billed Hourly. BYOL simply means you already own a license and you wish to apply this to a deployment in AWS.

If you are not familiar with AWS, there are many ways to configure resources which can be overwhelming to learn and optimize. I have done extensive testing in these product offerings to ensure stability, security, repeatability, and performance. With just a few clicks, you can have an on-demand cluster built that rivals a full rack of on-premise hardware.

Greenplum and GemFire on AWS
Greenplum Billed Hourly
Greenplum Bring Your Own License (BYOL)
GemFire Billed Hourly
GemFire BYOL

Dropping Corrupt Tables in Greenplum

Greenplum is a Massively Parallel Processing database which means the data is distributed across multiple segments on multiple hosts. Each segment process stores a catalog of objects and it is possible to get this out of sync because of hardware failures or killing processes. For example, if you are vacuuming a table and then decide to kill the process on the Master host, the segments may not get the signal properly and rollback the transaction. This can happen when using “kill -9” rather than just “kill” to stop a process.

You may get this error message if a table is corrupt:

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

So you decide to drop the table and start over but that doesn’t work either.

gpadmin=# drop table corrupt;
ERROR:  table "corrupt" does not exist  (seg0 gpdbsne:40000 pid=78502)

The easy fix for this is to use “if exists” in the drop command.

gpadmin=# drop table if exists corrupt;
DROP TABLE
Time: 10.423 ms

Now the table is gone and you can recreate it. Next time, don’t use kill -9 and don’t kill vacuum commands.

More Details
If you are wondering how I managed to corrupt this table, here are the steps.
1. Create the table

gpadmin=# create table corrupt (id int, foo text, bar text) distributed by (id);
CREATE TABLE
Time: 31.195 ms

2. Insert some data

insert into corrupt select i as id, 'foo_' || i, 'bar_' || i from generate_series(1,100) as i;
INSERT 0 100
Time: 46.419 ms
gpadmin=# analyze corrupt;
ANALYZE
Time: 108.193 ms
gpadmin=# \q

3. Connect to a segment directly. I’m using a single node for this test.

[gpadmin@gpdbsne ~]$ ps -ef | grep postgres | grep M
gpadmin   78213      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db-4.3.13.0/bin/postgres -D /data/primary/gpseg1 -p 40001 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1
gpadmin   78214      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db-4.3.13.0/bin/postgres -D /data/primary/gpseg0 -p 40000 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0
gpadmin   78239      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db-4.3.13.0/bin/postgres -D /data/master/gpseg-1 -p 5432 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E

I see that the segments are using ports 40,000 and 40,001 while the master is using 5432. I’ll connect to seg0 on port 40,000 and then drop the table. This only drops the object in this one segment. I’ll also do a quick query to make sure I only see half of the data because I have 2 segments.

PGOPTIONS='-c gp_session_role=utility' psql -p 40000
psql (8.2.15)
Type "help" for help.

gpadmin=# select count(*) from corrupt;
 count 
-------
    50
(1 row)

Time: 3.485 ms

4. Corrupt the database by dropping the table only on one segment.

drop table corrupt;
DROP TABLE
Time: 9.410 ms
gpadmin=# \q

5. Confirm that the table is corrupt.

[gpadmin@gpdbsne ~]$ psql
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

And again, the fix:

gpadmin=# drop table if exists corrupt;
DROP TABLE
Time: 10.423 ms

S3 External Tables

S3 is Amazon’s Simple Storage Service which is an inexpensive cloud storage solution and has quickly become a solution for cold data and backups. Greenplum now has External Tables that can read and write data to S3 so you can leverage this popular storage service with Greenplum. Here is how you do it!

Configuration
1. You will need an Amazon account with your Access Key ID and Secret Access Key. If you have the aws CLI installed and configured, just cat ~/.aws/credentials

2. Get your default region. This is in ~/.aws/config

3. You’ll need a bucket in this region and this can be done with the AWS web interface.

4. You’ll need a configuration file. Here is an example (be sure to change the secret and accessid).

[default]
secret = <secret>
accessid = <accesssid>
threadnum = 4
chunksize = 67108864
low_speed_limit = 10240
low_speed_time = 60
encryption = true

5. Copy the configuration to every Segment Host in your cluster.

for i in $(cat segment_hosts.txt); do scp s3_demo.conf $i:/home/gpadmin; done

Writing to S3

1. Create the Writable External Table with the S3 protocol, AWS URL that has the correct region, and the configuration file that is found on every Segment Host.

CREATE WRITABLE EXTERNAL TABLE public.demo_write
(id int, fname text, lname text)
LOCATION ('s3://s3-us-east-1.amazonaws.com/pivotalguru/demo config=/home/gpadmin/s3_demo.conf')
FORMAT 'TEXT' (DELIMITER '|' NULL AS '');

2. Execute an INSERT statement:

INSERT INTO demo_write 
SELECT i, 'Jon_' || i, 'Roberts_' || i
FROM generate_series(1,10000) as i;

Note: Each Segment will create a file in S3 in your bucket with the prefix you specify in the location specified in the Writable External Table. In this demo, each file is prefixed with “demo”. An example filename is “demo9767abbb3.txt”.

Reading from S3
1. Create the External Table with the same location and configuration as before.

CREATE EXTERNAL TABLE public.demo_read
(id int, fname text, lname text)
LOCATION ('s3://s3-us-east-1.amazonaws.com/pivotalguru/demo config=/home/gpadmin/s3_demo.conf')
FORMAT 'TEXT' (DELIMITER '|' NULL AS '');

2. Select the data.

gpadmin=# select * from demo_read limit 10;
 id  |  fname  |    lname    
-----+---------+-------------
  58 | Jon_58  | Roberts_58
  90 | Jon_90  | Roberts_90
 122 | Jon_122 | Roberts_122
 191 | Jon_191 | Roberts_191
 207 | Jon_207 | Roberts_207
 239 | Jon_239 | Roberts_239
 271 | Jon_271 | Roberts_271
 319 | Jon_319 | Roberts_319
 335 | Jon_335 | Roberts_335
 351 | Jon_351 | Roberts_351
(10 rows)

Time: 1139.538 ms

Tips
1. An S3 External Table that references a single file will only use a single Segment to read the data. Instead, try to have at least 1 file per Segment for an S3 External Table.
2. S3 External Tables supports gzip compression only.
3. Use S3 External Tables for cold storage or to create a backup of a table or query using a Writable External Table.

Greenplum Append Optimized Tables

Greenplum has two major types of table storage techniques. The first is inherited from PostgreSQL so it uses Multi-Version Concurrency Control (MVCC) and is referred to as HEAP storage. The second is more efficient technique called Append Optimized (AO). This blog post discusses the performance benefits of AO over HEAP tables.

HEAP
This is the default storage technique when you create a table in Greenplum. It handles UPDATE and DELETE commands just like PostgreSQL in terms of marking the old row invalid and inserting a new record into the table. This hides the deleted or stale row for new queries but provides read consistency to sessions that may have started querying the table prior to the UPDATE or DELETE.

Here is an example of a HEAP table.

CREATE TABLE heap_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
DISTRIBUTED BY (id);

But how does this HEAP table store data? How much space does it use?

INSERT INTO heap_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE heap_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'heap_table';

1114112

So it uses 1,114,112 bytes in my small single node test for this generated data of 10,000 records.

AO
AO storage was originally designed to be Append Only but starting with Greenplum version 4.3, these tables became Append Optimized because the tables now allow UPDATE and INSERT like a HEAP table but retain the efficient storage.

Create the same table and data but as an AO table.

CREATE TABLE ao_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true)
DISTRIBUTED BY (id);


INSERT INTO ao_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_table';

973016

This is now only using 973,016 bytes or a 12.6% reduction in disk space being used. For large tables, the performance difference will be more noticeable because it will take less time to scan your disks.

AO tables also let you compress it where a HEAP table does not. What does that look like?

CREATE TABLE ao_compressed_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true, compresstype=quicklz)
DISTRIBUTED BY (id);


INSERT INTO ao_compressed_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_compressed_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_compressed_table';

234344

Dropped more to 234,344 bytes. This is a huge decrease in disk space being used and measures at another 75.9% reduction! This is trading CPU cycles for IO scans which in most cases, will provide better query times.

In the last test, make the table also column oriented. This is also a feature only available for AO tables.

CREATE TABLE ao_co_compressed_table
(id int,
 fname text,
 lname text,
 address1 text,
 address2 text,
 city text,
 state text,
 zip text)
WITH (appendonly=true, compresstype=quicklz, orientation=column)
DISTRIBUTED BY (id);


INSERT INTO ao_co_compressed_table (id, fname, lname, address1, address2, city, state, zip) 
SELECT i, 'Jon_' || i, 'Roberts_' || i, i || ' Main Street', 'Apartment ' || i, 'New York', 'NY', i::text
FROM generate_series(1, 10000) AS i;

ANALYZE ao_co_compressed_table;

SELECT sotdsize
FROM gp_toolkit.gp_size_of_table_disk
WHERE sotdschemaname = 'public'
AND sotdtablename = 'ao_co_compressed_table';

196840

It dropped again! This time to 196,840 bytes or another 16%.

The total percentage difference between a HEAP table and the AO table that is column oriented and compressed is 82.33%!

Summary
Use AO tables. These are faster than HEAP tables and also let you compress and column orient your larger tables for even better performance.

Outsourcer 5.2.0 and Pivotal HDB 2.0.1

Pivotal HDB 2.0.1 (based on Apache HAWQ)
The newest release of Pivotal HDB is fantastic! It adds new features and resolves some issues as well. Here are the release notes.

JDBC Bug
One resolved issue affected Outsourcer which is a JDBC problem and is documented here: HAWQ-738. Pivotal HDB 2.0.0 was released in May and I found that Outsourcer suffered from stability issues under load and was generally slower navigating in the UI than HDB 1.3 or Greenplum Database. The issue was quickly resolved in that same month but the fix wasn’t publicly available until October of 2016 with this new release.

Quicklz
Quicklz is a compression library that uses GPL licensing. It is bundled with Greenplum Database and Pivotal HDB 1.3 and 2.0.0. Starting with Pivotal HDB 2.0.1, Quicklz has been removed. Why? Because of that GPL license and HAWQ is an Apache project.

In HDB 1.3 and 2.0.0, the guidance was to use Quicklz compression for row oriented tables but starting with 2.0.1, you should use Snappy compression. It is an easy change too:

CREATE TABLE mytable (id INT, fname TEXT) 
WITH (APPENDONLY=TRUE, COMPRESSTYPE=SNAPPY) 
DISTRIBUTED RANDOMLY;

Note: When upgrading to 2.0.1, be sure to first change your Quicklz compressed tables to either Zlib, Parquet with Snappy, or no compression at all. Then upgrade and then you can change back to row orientation and use Snappy compression. More details are in this section of the release notes.

Future Releases
Future releases of Pivotal HDB should come quicker because the licensing hurdle is now complete. This means developers can focus on enhancements and fixes rather than licensing.

Outsourcer 5.2.0
This new release officially supports Pivotal HDB 2.0.1. It makes the compression for row oriented tables to now use Snappy instead of Quicklz. If you are using Pivotal HDB 2.0.0 or Greenplum Database, Outsourcer will still use Quicklz compression.

Please consider upgrading to Pivotal HDB 2.0.0 to 2.0.1 especially if you are using Outsourcer. When I test Outsourcer with Pivotal HDB 2.0, I use build 22425 rather than build 22126 which is what you can download from Pivotal’s site. 22126 has the JDBC bug while 22425 and new builds do not. And when you upgrade to 2.0.1, also upgrade to Outsourcer 5.2.0.

Download 5.2.0!
Documentation
Source Code

New Project: gpresize

I was working with a customer recently that wanted to try more segments per host than what was originally configured in their cluster without adding more hosts. This is possible with gpexpand but only if mirroring has not been enabled. Disabling mirror, however, is not a supported feature of Greenplum but it is possible. So, to facilitate this analysis, I created “gpresize”.

Steps
– Removes mirrors (unsupported feature in Greenplum)
– Backup the database with gpcrondump
– Expands the cluster using gpexpand
– Shrink the database by reinitializing the database and restoring from backup
– Add mirroring back using gpaddmirrors and gpinitstandy

Please, don’t use this in a Production cluster. I’m using unsupported commands to do this work and it is intended for evaluation purposes only. It is intended to help you easily increase the number of segments per host and then revert back if needed.

Github: https://github.com/pivotalguru/gpresize

Getting the filename with gpfdist

Occasionally, I see the request to get the filename added to a file read by gpfdist. Here is a way to do it!

First, create a YML file named “transform_config.yml” with the following:

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transformation_input:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash transform.sh

Next, create that “transform.sh” file. This is just a simple example that gets all txt files but you can also pass in a parameter to this script. The filename in the external table gets passed to the script.

#!/bin/bash
set -e
for i in $(ls *.txt); do awk '{print FILENAME"|"$0}' $i; done

Create two test files (test_file_1.txt and test_file_2.txt).

cat test_file_1.txt 
1|foo1
2|foo2
3|foo3
4|foo4
5|foo5
6|foo6
7|foo7
8|foo8
9|foo9
10|foo10
cat test_file_2.txt 
11|foo11
12|foo12
13|foo13
14|foo14
15|foo15
16|foo16
17|foo17
18|foo18
19|foo19
20|foo20

Start gpfdist in the background.

gpfdist -p 8999 -c transform_config.yml > mylog 2>&1 < mylog &

Create the External Table but be sure to change the hostname. Note that "foo" in the LOCATION is the filename. I'm ignoring it for this example but this is how you can pass parameters to the script. You add %filename% to the YML file as the parameter to the script.

CREATE EXTERNAL TABLE ext_transform_table 
(filename text, id int, descrption text) 
LOCATION ('gpfdist://gpdbsne:8999/foo#transform=transformation_input') 
FORMAT 'text' (DELIMITER '|')

Now select from the External Table.

select * from ext_transform_table;
    filename     | id | descrption 
-----------------+----+------------
 test_file_1.txt |  1 | foo1
 test_file_1.txt |  2 | foo2
 test_file_1.txt |  3 | foo3
 test_file_1.txt |  4 | foo4
 test_file_1.txt |  5 | foo5
 test_file_1.txt |  6 | foo6
 test_file_1.txt |  7 | foo7
 test_file_1.txt |  8 | foo8
 test_file_1.txt |  9 | foo9
 test_file_1.txt | 10 | foo10
 test_file_2.txt | 11 | foo11
 test_file_2.txt | 12 | foo12
 test_file_2.txt | 13 | foo13
 test_file_2.txt | 14 | foo14
 test_file_2.txt | 15 | foo15
 test_file_2.txt | 16 | foo16
 test_file_2.txt | 17 | foo17
 test_file_2.txt | 18 | foo18
 test_file_2.txt | 19 | foo19
 test_file_2.txt | 20 | foo20
(20 rows)

HAWQ 2.0 Generally Available and Outsourcer 5.1.4

HAWQ, or commercially known as Pivotal HDB, just had a major release that I’m really excited about.

Major Features
– Based on Apache HAWQ and also includes support for Quicklz table compression plus support for PL/R, PL/Java, and pgCrypto
– Elastic runtime which means more segments (resources) can be allocated automatically based on the complexity of the query
– YARN integration
– Dynamic sizing of the cluster
– Block level storage which enables maximum parallelism
– Single HDFS directory per table which makes it easier to share and manage data
– Fault tolerance enhancements makes it easier and quicker to add or remove data nodes
– HDFS catalog cacheing
– HCatalog integration which greatly simplifies accessing Hive data
– New management interface with “hawq” commands
– Support for Ambari 2.2.2
– Plugin support for Kerberos
– Better logging for runaway query termination

Product Page
Documentation
Download

Outsourcer 5.1.4
I also have updated Outsourcer to take advantage of HAWQ 2.0/Pivotal HDB 2.0. In HAWQ 2.0/Pivotal HDB 2.0, tables should be distributed randomly in order to take advantage of many of the new features. Starting with version 5.1.4, Outsourcer will now make all tables distributed randomly when the database is HAWQ 2.0/Pivotal HDB 2.0. For Greenplum and HAWQ 1.3, the tables will still be distributed by the source’s primary key if one is found.

Documentation
Download 5.1.4
Source Code

Outsourcer 5.1.1

5.1.1 enhances Append jobs to use Big Integer in addition to Integer data types. Additionally, you can now use Timestamp data types.

Be sure to always use an ordered sequence in Oracle and an ordered identity in SQL Server when using an Append job. Timestamp is useful when you are using the system timestamp in Oracle or SQL Server to append new data.

Download
Documentation
Source Code

HAWQ Demystified

The genesis of the best SQL engine in Hadoop is not an overnight, “me too” product. It seems that it wasn’t too long ago the Hadoop vendors all but wrote off SQL but the demand just hasn’t gone away.

HAWQ is the result of many, many years of work leveraging open source software as well as contributing back to the open source community. I think a short history lesson is in order to fully understand how this product came to be.

greenplum
Greenplum Database began life in 2003. The founders used open source PostgreSQL and released a commercial product soon after.

bizgres
Bizgres, an open source version of Greenplum, was released in 2005. Very early on, the founders of Greenplum embraced contributing back to the open source community.


Madlib was released in 2010 as an open source project which later became an Apache Incubator project.

EMC
Greenplum was acquired by EMC in 2010 and almost immediately, EMC invested heavily into Hadoop. The Greenplum division was agile like a small startup company but with the deep pockets of a multi-billion dollar company.

GPHD
Greenplum released a Hadoop distribution in 2011 and integration between Greenplum Database and HDFS got more robust with the introduction of “gphdfs”. Greenplum supported External Tables to read/write data in parallel to HDFS from several different distributions.

HAWQ
HAWQ, a fork of Greenplum Database, was released in 2013. HAWQ was immediately extremely performant and compliant with the newest SQL syntax. HAWQ borrowed from the 10 years experience of developing Greenplum to provide a robust optimizer designed for HDFS.

pivotal155px
2013 also saw Pivotal become a company. EMC contributed Greenplum Database, VMWare contributed Gemfire and Cloud Foundry, and GE contributed capital as an active partner. Paul Maritiz became the CEO and the dedication to fully embrace open source became an integral part of the corporate culture.

During the last three years, HAWQ has become an Apache Incubator Project. The Pivotal product is now a rather boring name of “Pivotal HDB” while HAWQ is the name of the Apache project.

Pivotal also made Greenplum and Geode (Gemfire is the commercial product name) open source projects too. Clearly, Pivotal has embraced open source with probably more committers than those other “open source” data companies.

So what now? What is happening in 2016? Well, Pivotal is about to release Pivotal HDB (HAWQ) 2.0. I’ve been testing this product for months on various platforms and I keep getting amazed by the performance and ease of use.

HAWQ 2.0 embraces Hadoop fully. I believe the two biggest features are elasticity and performance. HAWQ now supports elasticity for growing or shrinking clusters without having to redistribute the data. The performance is also much improved as it better utilizes HDFS and YARN.

Pivotal HDB is certified to run on Hortonworks HDP with plans on becoming a first class citizen of the Open Data Platform (ODPi).

So you may be asking, “is it fast?” and the answer is yes! I haven’t found a SQL engine that is faster and I’ve been doing competitive analysis for months. The other question you may ask is, “can I run my SQL?” and the answer is yes! A major competitor in the SQL on Hadoop landscape requires more tweaking of SQL just to get the SQL to execute and more tuning to get decent performance.

That “other SQL on Hadoop” product can’t do the following things, as well as many more, that HAWQ can.

– Can’t handle a comment line at the end of SQL file (I found that rather strange)
– Won’t do partition elimination through a joined table (e.g. date dimension)
– Can’t get number of milliseconds when subtracting from a date. Only gets seconds.
– Has “interval” in SQL dialect but doesn’t have interval as a type.
– No time data type.
– Concatenating strings doesn’t use || or +. You must use concat() function.
– Doesn’t support intersect or except.
– Doesn’t support subqueries that return more than one row.
– Doesn’t support correlated subqueries.
– Doesn’t support group by rollup.
– Doesn’t support subqueries in having statement.
– Subqueries not supported in select list.

HAWQ is the real deal. It is an Apache project, going to be part of ODPi, faster than everyone else, integrated with Apache Ambari, certified with Hortonworks, and the most mature SQL engine for Hadoop.

Analyzedb Performance Metrics

I have already written about AnalyzeDB in this post but I thought I would write another post about it with performance metrics.

The old method that I used to analyze tables was to analyze every table and partition sequentially. Then I would analyze the root partition of partitioned tables. The script would look like this:

Tables and Partitions

psql -t -A -c "SELECT 'ANALYZE ' || n.nspname || '.' || c.relname || ';' FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'tpcds' AND c.relname NOT IN (SELECT DISTINCT tablename FROM pg_partitions p WHERE schemaname = 'tpcds') ORDER BY 1" | psql -e

Root Partitions of Partitioned Tables

psql -t -A -c "SELECT 'ANALYZE ROOTPARTITION ' || n.nspname || '.' || c.relname || ';' FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'tpcds' AND c.relname IN (SELECT DISTINCT tablename FROM pg_partitions p WHERE schemaname = 'tpcds') ORDER BY 1" | psql -e

The second method is with analyzedb which can be done with a single line.

analyzedb -d gpadmin -s tpcds --full -a

The default for analyzedb is to use 5 threads so you can adjust this to maximize performance. You can also take advantange of how analyzedb keeps track of the tables it has analyzed so it won’t unnecessarily analyze tables which makes the process even faster.

Here are some numbers to put this into perspective. I’m using virtual machines with only 1GB of data but the percentage improvement is what we are wanting to measure.

HAWQ 2.0 Beta
Sequential: 18 minutes and 37 seconds
Analyzedb: 8 minutes and 3 seconds
Improvement: 131% faster!

Greenplum Database 4.3
Sequential: 11 minutes and 25 seconds
Analyzedb: 6 minutes and 59 seconds
Improvement: 63% faster!

If you aren’t using analyzedb to maintain your HAWQ and/or Greenplum databases, start using it now! You’ll see much better performance in keeping your tables’ statistics up to date.

Outsourcer 5.0.9

5.0.9 adds support for HAWQ 2.0.

I’m looking for feedback on how best to handle table distribution for tables created in HAWQ 2.0. Outsourcer automatically sets distribution keys based on the source primary keys so it always uses hash when there is a PK found. HAWQ 2.0 supports hash and random distribution as before but random distribution allows a cluster to be resized without having to redistribute the data.

– Should I keep the code as-is?
– Should I set a global environment variable to allow you to set all tables to be created random or not?
– Should I update nearly every UI screen as well as the job and queue tables to have a random boolean that is only used for HAWQ 2.0?

Download
Documentation
Source Code

Greenplum Single Node Installation

Step 1
Download a CentOS 6 VM from http://virtual-machine.org/.

Step 2
Download the latest Greenplum binaries for RedHat Enterprise Linux 6 from http://network.pivotal.io
GPDB Download

Step 3
Start the Virtual Machine with VMWare Fusion or something similar.
Memory: 8GB
Cores: 4
Disk: 50GB
You can use less memory and cores but the more you provide the VM, the better it will perform. You might have to expand the VM disk space when using the VirtualMachine.org VM.

Step 4
Configure the operating system.

hostname gpdbsne

Add the hostname to /etc/sysconfig/network too.

Turn off firewalls.

chkconfig iptables off
service iptables stop
echo 0 >/selinux/enforce
vi /etc/selinux/config
SELINUX=disabled
setenforce 0

Edit the /etc/hosts file.

echo "127.0.0.1 gpdbsne gpdbsne.localdomain" >> /etc/hosts

I also like to get the ip address for this host and add it to my local /etc/hosts file.

ifconfig

Install unzip, ed, ntp and ssh.

yum install ntp
yum install unzip
yum install openssh-clients
yum install ed
chkconfig ntpd on
ntpdate pool.ntp.org
/etc/init.d/ntpd start

Add the following to the end of your /etc/sysctl.conf file.

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.overcommit_memory = 0

Add this to your /etc/security/limits.conf file.

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

Remove all lines from /etc/security/limits.d/90-nproc.conf

echo "" > /etc/security/limits.d/90-nproc.conf

There are some other configuration changes you make on a real cluster that involves XFS filesystem but for a SNE, this can be skipped. This is just intended for Development and Testing purposes.

Restart the VM so these changes take affect.

shutdown -r now

Step 5
Copy the installer to the VM.

scp greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.zip root@gpdbsne:/root/

Step 6
ssh to the VM and run the installer.

ssh root@gpdbsne
unzip greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.zip
/bin/bash greenplum-db-4.3.6.2-build-1-RHEL5-x86_64.bin

--Accept the license agreement
--Accept default installation directory

Step 7
For a multi-node cluster, the next step is to use gpseginstall but this isn’t needed with a single node installation. Instead, you have to manually create the gpadmin account and get the cluster ready for the next step.

useradd gpadmin
passwd gpadmin
chown -R gpadmin:gpadmin /usr/local/greenplum-db-4.3.6.2/
mkdir -p /data/master
mkdir /data/primary
chown -R gpadmin:gpadmin /data
su - gpadmin
echo "source /usr/local/greenplum-db/greenplum_path.sh" >> .bashrc
echo "export MASTER_DATA_DIRECTORY=/data/master/gpseg-1" >> .bashrc
source .bashrc
echo "gpdbsne" > hostfile
gpssh-exkeys -f hostfile

Step 8
Create an initialize file called gp_init_config so you can initialize the database.

ARRAY_NAME="Greenplum"
MACHINE_LIST_FILE=./hostfile
SEG_PREFIX=gpseg
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/data/primary /data/primary )
MASTER_HOSTNAME=gpdbsne
MASTER_DIRECTORY=/data/master
MASTER_PORT=5432
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE

Step 9
Initialize the database.

gpinitsystem -c ~/gp_init_config
--select Y to continue

Now create the default database and configure the database to allow external connections.

psql -c "create database gpadmin" template1
psql -c "alter user gpadmin password 'changeme'"
echo "host all all 0.0.0.0/0 md5" >> /data/master/gpseg-1/pg_hba.conf
gpstop -u

Complete!
Now you can connect to Greenplum with the gpadmin account and the password is changeme. The default port is 5432. A great client tool is pgAdmin III v1.14.3.
http://www.postgresql.org/ftp/pgadmin3/release/v1.14.3/

Updated gplink

I’ve made some enhancements to gplink.

1. Automatically starts gpfdist when creating External Tables
2. New command line programs: gpltable (create table), gpldrop (drop table), and gplstart (start any missing gpfdist processes)
3. Removal of the need for the Greenplum or HAWQ client
4. No additional database objects are needed
5. Drop External Table handled from gplink also stops the gpfdist process it had started
6. Most code moved to Java with some basic shell scripts to automate calling the Java methods

Documentation and Source Code
gplink Github Repository

Installer
Download

TPC-DS Benchmark

TPC
The TPC is a non-profit organization that provides several benchmarks for databases. The two common benchmarks they provide for Data Warehousing and Analytics are the Decision Support (TPC-DS) and Ad-Hoc (TPC-H) benchmarks. More information can be found here: http://www.tpc.org/.

SQL on Hadoop
HAWQ can execute all 99 queries without modification.
– IBM Big SQL can execute all 99 queries but requires modifying 12 queries (as of the most recent publication I can find).
– Impala can’t run all of the queries and many require modifications to execute.
– Hive can’t run all of the queries and many require modifications to execute.

There isn’t much point to compare HAWQ with these other SQL engines when none can execute all of the queries without modification! So… this post will focus on the capabilities of HAWQ as well as the automated TPC-DS benchmark process I have put together.

Sandbox
The tests will use a Hortonworks 2.2 Sandbox with 8GB of RAM and 4 cores dedicated to the VM. The dataset will be 2GB in size which works well in a VM. This is definitely not a huge test but it will demonstrate the capabilities of HAWQ. Again, all 99 queries can run in HAWQ without any modification!

TPC-DS
I’m using the tests provided here: https://github.com/pivotalguru/TPC-DS which I put together to help evaluate different hardware platforms. This script works with HAWQ and with Greenplum database.

This script automates the entire process of generating the data, building the tables, loading the tables, and executing the TPC-DS queries.

Results
1. Compile TPC-DS: 3.4 seconds
2. Generate the Data using dsdgen: 3 minutes 23 seconds
3. Create tables and ensure correct optimizer settings are enabled: 5.5 seconds
4. Load the tables: 2 minutes and 5 seconds
5. Execute all 99 TPC-DS queries: 6 minutes 46 seconds

TPC-DS_2GB

To put this into perspective, I attempted to run a 2GB TPC-DS benchmark in the same VM with another SQL on Hadoop tool and it took 2 hours to just get all of the data loaded! The engine was unable to execute the vendor provided SQL queries either so I gave up on that effort. HAWQ is the way to go!

Greenplum Database 4.3.6.0

Download
Documentation

There are lots of new features in this maintenance release of Greenplum database.

1. External Tables can now read Parquet and Avro files in HDFS.
2. Partitioned Tables can now have External Partitions. This would be useful to have a Greenplum table with some data residing external to the database in something like Hadoop.
3. More enhancements to the Query Optimizer.
4. Native connection pooling.
5. COPY commands can be controlled by Resource Queues.
6. Utility analyzedb can now analyze catalog tables.
7. Utility gptransfer has enhancements.
8. Improvements in resynchronization of segment mirrors.
9. Utilities gpmigrator and gpmigrator_mirror have enhancements.
10. Improvements in gpsupport utility.

ANALYZE and analyzedb

Note: This post is for Greenplum only and I will make a subsequent post for HAWQ.

What Is Analyze?
Analyze is a pretty simple command that gets statistics on tables so the cost based optimizer can make the best plan possible. Greenplum has a mechanism to do this automatically for you but in some circumstances it won’t so you will have to run it manually.

gp_autostats_mode=on_no_stats
This is the default which simply means the database will automatically gather stats when data is inserted into the table for the first time. When you do a Create Table As Select (CTAS), it will gather the statics for you. If you create a table and then INSERT data, it will gather stats for you. But, if you INSERT data to a table that already has stats, Greenplum will not gather stats for you again.

Example

CREATE TABLE foo (id int, fname text) DISTRIBUTED BY (id);
INSERT INTO foo VALUES (1, 'jon');
INSERT INTO foo select i, 'jon_' || i from generate_series(1,1000000) as i;

In the above example, the first INSERT statement will cause the database to gather the stats and it will record that foo only has 1 row in it. The next INSERT of 1 million records won’t trigger another ANALYZE. Do you see where you could get into trouble? You need to run ANALYZE on foo.

ANALYZE foo;

gp_autostats_mode=on_change
This sounds like a good idea. If you insert more data into a table, it will automatically gather stats for you. But, there is another GUC that is important here. gp_autostats_on_change_threshold determines how many rows need to be added to a table before an ANALYZE is run automatically again. The default value is 2,147,483,647!

So in our first example of adding 1 million rows to an existing table, ANALYZE will not run automatically if we were to set gp_autostats_mode=none.

If you want to use on_change, I suggest reducing the threshold to a lower number.

gp_autostats_mode=none
This is the last value you can set for this GUC and it does exactly what is says. Nothing.

Partitioned Tables
If you insert data into an empty partitioned table, an ANALYZE will never automatically be executed! You will have to execute ANALYZE manually.

This is really important. Inserting data into partitioned tables will not cause an ANALYZE to be executed. The stats will be incorrect and you will need to run ANALYZE on the table.

Root Partition
The new Query Optimizer, code named Orca, must have statistics on the “rootpartition”. This simply is a rollup of all of the statistics from each partition at the table level.

When enabling Orca, you will need to add this which is documented in this blog post.:

gpconfig -c optimizer_analyze_root_partition -v on --masteronly

This will tell Greenplum to gather the table level statistics on a table when you ANALYZE it. If you don’t do this, the table will appear to be empty to Orca and it will most certainly have a poor query plan.

You can also run ANALYZE on just the rootpartition.

ANALYZE ROOTPARTITION foo;

Please Make this Easy! analyzedb
We did! There is a new utility called “analyzedb” which tracks what tables and partitions need to be analyzed and does it for you automatically. It also does the work in parallel so it completes the task much, much faster.

You can specify an entire database:

analyzedb -d gpdb

single schema:

analyzedb -d gpdb -s myschema

or a single table.

analyzedb -d gpdb -t public.foo

analyzedb is the the program to add to your maintenance scripts to ensure you have the correct statistics in the database. It is another example of the continual development and enhancements of Greenplum database.

Enabling The New Greenplum Query Optimizer

Starting with Greenplum Database 4.3.5.0, there is a new Query Optimizer that is now available that is faster than the legacy optimizer.

History
Greenplum Database is a fork of PostgreSQL 8.2. This fork was done to optimize SQL queries for a distributed cluster and it has done very well over the years. Often times Greenplum faster that competing MPP databases because of this Query Optimizer. But, because it has many years of development on top of PostgreSQL, it has become difficult to manage and enhance.

Orca was the code name of the new Query Optimizer that is a refactor of the legacy Query Optimizer that also includes enhancements for performance and maintainability. This post is about how to enable the new Query Optimizer.

Enabling
Note: You must be using Greenplum Database 4.3.5.0 or newer.

1. Log into the Master server with ssh as gpadmin.
2. Make sure the database is running and healthy.

gpstate

3. Execute gpconfig:

gpconfig -c optimizer_analyze_root_partition -v on --masteronly

4. Enable the new optimizer for the entire system.

gpconfig -c optimizer -v on --masteronly

5. Apply the changes with gpstop.

gpstop -u

6. Disable the new Query Optimizer for the gpperfmon database.

ALTER DATABASE gpperfmon SET OPTIMIZER = OFF;

7. Verify the new Query Optimizer is on.

[gpadmin@mdw ~]$ psql
psql (8.2.15)
Type "help" for help.

gpdb=# explain select version();
                   QUERY PLAN                   
------------------------------------------------
 Result  (cost=0.00..0.00 rows=1 width=8)
   ->  Result  (cost=0.00..0.00 rows=1 width=1)
 Settings:  optimizer=on
(3 rows)

8. Using psql, change to the gpperfmon database and make sure the new Query Optimizer is off.

gpdb=# \c gpperfmon
You are now connected to database "gpperfmon" as user "gpadmin".
gpperfmon=# explain select version();
                QUERY PLAN                
------------------------------------------
 Result  (cost=0.00..0.01 rows=1 width=0)
 Settings:  optimizer=off
(2 rows)

gpperfmon=# 

XML Loading Again

I covered this topic here before with a solution that handles very large XML files in Hadoop. This blog post covers parsing XML as it is loaded in Hawq or Greenplum database.

sample.xml

<?xml version="1.0"?>
<catalog>
      <large-product>
         <name>foo1</name>
         <price>110</price>
      </large-product>
      <large-product>
         <name>foo2</name>
         <price>120</price>
      </large-product>
      <large-product>
         <name>foo3</name>
         <price>130</price>
      </large-product>
      <large-product>
         <name>foo4</name>
         <price>140</price>
      </large-product>
      <large-product>
         <name>foo5</name>
         <price>150</price>
      </large-product>
      <small-product>
         <name>bar1</name>
         <price>10</price>
      </small-product>
      <small-product>
         <name>bar2</name>
         <price>20</price>
      </small-product>
      <small-product>
         <name>bar3</name>
         <price>30</price>
      </small-product>      
      <small-product>
         <name>bar4</name>
         <price>40</price>
      </small-product>
      <small-product>
         <name>bar5</name>
         <price>50</price>
      </small-product>
</catalog>
</pre>

sample.xsl

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="catalog"><xsl:for-each select="large-product">
Large|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
<xsl:for-each select="small-product">
Small|<xsl:value-of select="name"/>|<xsl:value-of select="price"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

sample.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  sample:
     TYPE: input
     CONTENT: data
     COMMAND: /usr/bin/xsltproc sample.xsl %filename%

Start gpfdist:

gpfdist -c sample.yml -p 8080 >> sample.log 2>&1 < sample.log &

Create External Table

create external table sample
(product_type text, product_name text, product_price int)
location ('gpfdist://bigmac:8080/sample.xml#transform=sample')
format 'text' (delimiter '|' header);

Select the data.

gpdb=# select * from sample; 
product_type | product_name | product_price 
--------------+--------------+---------------
 Large        | foo1         |           110
 Large        | foo2         |           120
 Large        | foo3         |           130
 Large        | foo4         |           140
 Large        | foo5         |           150
 Small        | bar1         |            10
 Small        | bar2         |            20
 Small        | bar3         |            30
 Small        | bar4         |            40
 Small        | bar5         |            50
(10 rows)

This solution works great for parsing reasonably sized XML files into a relational format but if you have very large files, use Hadoop and review my other blog post here.

New Project: gplink

I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.

This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.

gplink v0.0.2
Source Code

This is a new project so I’m looking forward to any feedback you can provide.

Outsourcer 5.0.5

I have made improvements around the starting and stopping of gpfdist processes. It now handles the situation where a gpfdist process doesn’t start properly and instead of failing, it will pick the next available port. I also completely separated osstart and uistart. The User Interface does not need to be running in order for Outsourcer to load tables in the Queue or select from Custom SQL Tables.

This version is a release candidate as testing has shown it is very stable.

Download
Documentation
Source Code

Outsourcer 5.0.4

Here is another new release of Outsourcer 5. I will likely have another release by the end of the week to further clean up the code so be on the lookout for that. I will update the documentation for release 5.0.5 as well.

I am making changes to Outsourcer 5 due to the move from External Web Table to External Tables using gpfdist. In my testing with 4 concurrent jobs, gpfdist seemed to work fine with only one process handling the load. After stress testing, I have learned that gpfdist is not multi-threaded and a single process can not handle the load of multiple jobs.

I have resolved this issue in 5.0.3 by starting and stopping gpfdist for every job. Upon further stress testing, I identified a potential problem when gpfdist fails to start properly. This would cause multiple jobs to fail in a cascading fashion.

5.0.4 resolves the scenario when gpfdist fails to start. The single job will fail and not affect other jobs.

This release also decouples the User Interface from Outsourcer processing jobs in the Queue. You can run Outsourcer without the User Interface running at all.

I also identified a scenario where Outsourcer 5 is installed on the Master was causing problems. The installer was expecting the installation to happen on the Standby-Master or on a dedicated ETL server so the installer would create a new .pgpass file and it would also backup an existing one. This would remove the gpmon entry for and prevent Command Center from working correctly. 5.0.4 now keeps the gpmon entry, if found, in the .pgpass file. If you have a Standby-Master or an ETL host, install Outsourcer 5 there instead.

And lastly, if you just want a stable release of Outsourcer that isn’t changing frequently, use Outsourcer 4.1.6 which hasn’t changed since January and uses External Web Tables.

5.0.4
Source Code

External Table “TRANSFORM” Option

In the Greenplum Administrator Guide, there is a section that covers loading XML data with gpload. It also mentions that you can create the External Table yourself. This demo will show you how to create an External Table that utilizes this feature so that you can execute any script you want on your ETL server. This demo also works for HAWQ.

XML Parsing is done sequentially from the top to the bottom of a file and the TRANSFORM option was built with this in mind. The gpfdist process will execute a script for you and the output is read by the segments in parallel. Instead of parsing XML with a script, this demo will execute a Unix command in a script to show you how you can leverage this feature to execute virtually any command you want on a remote server.

ETL Server
/data1/demo5/demo5.yml

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  transform_demo5:
     TYPE: input 
     CONTENT: data
     COMMAND: /bin/bash get_df.sh

As you can see, I created a transform named “transform_demo5” that executes the script “get_df.sh”. So let’s look at get_df.sh.
/data1/demo5/get_df.sh

df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2

This simple command executes the Unix df command and converts that to a pipe delimited file. Executing the command outputs this:

[pivhdsne:demo5]$ df -k | awk '{print $1"|"$2"|"$3"|"$4"|"$5"|"$6}' | tail -n +2
/dev/sda3|47472560|12080036|32981056|27%|/
tmpfs|4030752|0|4030752|0%|/dev/shm
/dev/sda1|99150|48764|45266|52%|/boot

Now start a gpfdist process that uses a configuration file. The “-c” option isn’t well documented with gpfdist but it is mentioned with the XML parsing.

gpfdist -p 8999 -c /data1/demo5/demo5.yml 2>&1 > demo5.log &

Now create an External Table. Notice the format of using #transform=transform_demo5. The filename of “foo” is ignored but you can reference the filename as a parameter to your transform scripts.

/data1/demo5/get_df.sql

CREATE EXTERNAL TABLE get_df 
(Filesystem text,
 K_blocks int,
 Used int,
 Available int,
 Used_percentage text,
 Mounted_on text)
LOCATION ('gpfdist://pivhdsne:8999/foo#transform=transform_demo5')
FORMAT 'TEXT' (DELIMITER '|');

Now create the table in your database server (Hawq or Greenplum).

psql -f get_df.sql -h gpdbvm43 
Password: 
Timing is on.
CREATE EXTERNAL TABLE
Time: 238.788 ms

Now let’s see the output.

[pivhdsne:demo5]$ psql -h gpdbvm43
Password: 
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from get_df;
 filesystem | k_blocks |   used   | available | used_percentage | mounted_on 
------------+----------+----------+-----------+-----------------+------------
 /dev/sda3  | 47472560 | 12104924 |  32956168 | 27%             | /
 tmpfs      |  4030752 |        0 |   4030752 | 0%              | /dev/shm
 /dev/sda1  |    99150 |    48764 |     45266 | 52%             | /boot
(3 rows)

Time: 32.149 ms

This is yet another feature of Greenplum and Hawq that gives you more flexibility in working with data. I’ve seen this feature used to move files, change permissions, get data from running programs, and of course parsing XML. Enjoy!

Open Data Platform

Pivotal teams up with Hortonworks for the Open Data Platform. Pivotal is also making the data products (Gemfire, Greenplum Database, and HAWQ) all open source.

http://blog.pivotal.io/big-data-pivotal/news-2/pivotal-big-data-suite-open-agile-cloud-ready

http://hortonworks.com/blog/pivotal-hortonworks-announce-alliance/

http://blog.pivotal.io/big-data-pivotal/news-2/why-the-open-data-platform-is-such-a-big-deal-for-big-data

Outsourcer 4.1.5 released

Bugs Fixed
1. Error messages raised from Oracle and SQL Server could include special characters which would cause the thread in Outsourcer to die and the status remain labeled as “processing”. Error messages no longer have escape and quote characters to prevent this problem.
2. On new installations, the permissions for the Oracle and SQL Server Jar files would remain owned by root. This has now been corrected to be the admin user (gpadmin).
3. On the “Sources” screen, the label for Append-Only now is correct for Greenplum Database 4.3 and reflects that these are really “Append-Optimized” tables.

Enhancements
1. Added External Tables to enable starting and stopping the UI rather than having to ssh to the Master host.
2. “Queue” screen now provides the ability to cancel processing jobs.
3. Stopping Outsourcer Queue Daemon (Environment Screen), now will cancel currently running Jobs in the Queue.
4. Starting Outsourcer Queue Daemon (Environment Screen), now will make sure orphaned jobs in the Queue will be marked as Failed and if any are currently running, it will cancel these jobs too.

Download
Documenation
Github

Greenplum Database Connecting to Hortonworks UPDATED

Hortonworks 2.1 is now officially supported by Pivotal in the Greenplum Database version 4.3.3.x and 4.2.8.x by using External Tables.

Greenplum Database 4.3.3 Documentation
Greenplum Database 4.2.8 Documentation

The documentation has the basic configuration but I wanted to provide more detailed instructions.

Hortonworks
For my test, I downloaded the Hortonworks HDP 2.1 Sandbox. Next, I started the instance and I created a user named “gpadmin” just to make it easier for me to connect to the cluster.

Pivotal Greenplum Database
Step 1: Downloaded the Greenplum Database Trial Virtual Machine and started the cluster.

Step 2: As root, add the Hortonworks repo

wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.7.0/hdp.repo -O /etc/yum.repos.d/hdp.repo

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 3: As root, install the Hadoop client

yum install hadoop* openssl

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 4: As root, remove previous OpenJDK and install Oracle JDK. You will need to download the Oracle JDK from Oracle.

yum remove java
rpm -i jdk-7u67-linux-x64.rpm

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 5: As gpadmin, start the Greenplum database and set the Hadoop target version to Hortonworks

gpconfig -c gp_hadoop_target_version -v hdp2

Step 6: As gpadmin, edit your .bashrc and add the following

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/hadoop/client

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 7: As gpadmin, restart the database

source .bashrc
gpstop -r -a

Step 8: As root, add host entry to the Master and Standby-Master for the Hadoop cluster

echo "192.168.239.219 hdp21" >> /etc/hosts

Note: This will need to be done on all nodes in a full Greenplum cluster and all Hortonworks nodes should be in the hosts file too. DNS can also be used.

Build Tables

DROP EXTERNAL TABLE IF EXISTS ext_foo;
CREATE WRITABLE EXTERNAL TABLE ext_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 60 ms.
INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;
--Query returned successfully: 100 rows affected, 2636 ms execution time.
DROP EXTERNAL TABLE IF EXISTS ext_get_foo;
CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 28 ms.
SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
/*
1;"bar_1"
2;"bar_2"
3;"bar_3"
4;"bar_4"
5;"bar_5"
6;"bar_6"
7;"bar_7"
8;"bar_8"
9;"bar_9"
10;"bar_10"
*/

Maintenance Script for HAWQ and Greenplum database

A common question I get is what maintenance activity is needed in HAWQ and Greenplum database. There are lots of ways to handle this and a lot of customers do it as part of their ETL processing. I put together a script that is pretty generic and should be very useful for most implementations.

The script is available on Github in a new project I created for scripts. As I come up with more scripts for various activities, I will put it in this Github repository.

Link to repository

maintain.sh
Note: Run this script on a regular basis such as weekly or monthly.

Enjoy!

When to ANALYZE in Greenplum and HAWQ?

Table statistics gives the cost based optimizer information needed to build the best query plan possible and this information is gathered with the ANALYZE command. But when should you execute an ANALYZE on a table? Doesn’t Grenplum and HAWQ do this automatically?

Greenplum and HAWQ will perform an ANALYZE automatically for you so the query optimizer will have good statistics and build a good plan. Greenplum and HAWQ also allow you to configure this if needed with gp_autostats_mode and gp_autostats_on_change_threshold.

gp_autostats_mode
This specifies when an automatic ANALYZE should take place. The values for this configuration parameter can be:
none
on_change
on_no_stats

on_no_stats
The default is on_no_stats and the database will automatically perform an ANALYZE for you. If you INSERT data into an empty table (new table or a table you just did a TRUNCATE on), the database will automatically gather statistics with an ANALYZE.

Typically in an Analytics Data Warehouse, you will insert data once and then read it many times. So the default of on_no_stats will automatically give you statistics as needed with no additional work from you.

none
Self explanatory. You have to execute ANALYZE if you want statistics.

on_change
With this setting, if perform an INSERT, UPDATE, or DELETE that exceeds the gp_autostats_on_change_threshold value, then an automatic ANALYZE will happen. Note for HAWQ, you can only INSERT.

In Greenplum and in the rare case where you are doing a DELETE or UPDATE to a table, you will need to execute an ANALYZE statement. Or, you can set the database to use on_change and the ANALYZE will happen automatically.

gp_autostats_on_change_threshold
This is only relevant for on_change and it is the number of rows that need to change before an automatic ANALYZE will happen. The default is over 2 billion rows so if you really want to use on_change, then you will also need to reduce this configuration parameter to something more realistic.

Examples

Example 1 – The default
gp_autostats_mode = on_no_stats
gp_autostats_on_change_threshold = 2147483647

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics won't be updated.  The planner will think the table has only 1 row.

ANALYZE my_table;
--Statistics manually gathered and correctly shows the correct number of rows in the table.

Example 2 – Using on_change
gp_autostats_mode = on_change
gp_autostats_on_change_threshold = 1000000

CREATE TABLE my_table AS 
SELECT state, count(*) AS counter
FROM customer
GROUP BY state
DISTRIBUTED BY (state);
--Statistics will be gathered automatically only if the number of rows is 1M or more.
CREATE TABLE my_table 
(state varchar(100),
 counter int)
DISTRIBUTED BY (state);

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.
TRUNCATE my_table;

INSERT INTO my_table
VALUES ('dummy', 0);
--Statistics will not be gathered automatically.

INSERT INTO my_table
SELECT state, count(*) AS counter
FROM customer
GROUP BY state;
--Statistics will be gathered automatically only if the number of rows is 1M or more.

Checking the Statistics
And here are a couple of queries you can use to see the statics information.

SELECT c.oid, c.relpages, c.reltuples 
FROM pg_class c 
JOIN pg_namespace n ON c.relnamespace = n.oid 
WHERE n.nspname = 'public' 
AND c.relname = 'my_table';
--using the oid from the previous query
SELECT * FROM pg_statistic WHERE starelid = 298610;

Summary
Greenplum and HAWQ automatically gather statistics for you in most cases. If you are doing lots of DML activity, you can change the configuration to still automatically gather statistics for you too. These automatic settings make life easier for DBAs, Developers, and Analysts but still give you the flexibility to configure it in the best way for your environment.

Greenplum Database Incremental Backup

Greenplum Database has a feature called “incremental” backup but really it is a incremental differential backup. In this post, I will explain how it works and give you a working example.

AO Tables
In version 4.2 and earlier, AO tables are Append Only but starting in 4.3, the tables are Append Optimized. Everyone using 4.3 should be using AO tables as it uses less space, supports this incremental feature, and executes faster. Incremental backup requires AO tables.

Example:

CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);

Using the incremental option on a regular heap table will not make any difference. It will backup all heap tables with every backup.

Demo

First create a database.

gpdb=# drop database poc;
DROP DATABASE
gpdb=# CREATE DATABASE poc;
CREATE DATABASE
gpdb=# \c poc
You are now connected to database "poc" as user "gpadmin".

Now create an AO table and insert some data.

poc=# CREATE TABLE foo (id integer, bar text) WITH (appendonly=true) DISTRIBUTED BY (id);
CREATE TABLE
poc=# INSERT INTO foo VALUES (1, 'bar_1');
INSERT 0 1
poc=# INSERT INTO foo VALUES (2, 'bar_2');
INSERT 0 1
poc=# 

I’m next going to execute a full backup. For the demo, I’m going to turn off compression so it is easier to view the backup files.

I’m using gpcrondump with these switches:
-x specifies the database to backup
-a means automatic and don’t prompt
-z turns off compression

bigmac:~ gpadmin$ gpcrondump -x poc -a -z
20140909:13:39:19:006156 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z
...

Just to see what is in the backup file, I’m going to cat one of the segment’s backup files. It shows one of the two records in this tiny table.

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909133919 
--
-- Greenplum Database database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

SET search_path = public, pg_catalog;

SET default_with_oids = false;

--
-- Data for Name: foo; Type: TABLE DATA; Schema: public; Owner: gpadmin
--

COPY foo (id, bar) FROM stdin;
1	bar_1
\.


--
-- Greenplum Database database dump complete
--

Now I immediately run an incremental backup. Nothing changed in my database so it should be an empty backup set.

bigmac:~ gpadmin$ gpcrondump -x poc -a -z --incremental
20140909:13:41:19:006395 gpcrondump:bigmac:gpadmin-[INFO]:-Starting gpcrondump with args: -x poc -a -z --incremental
...

To verify:

bigmac:~ gpadmin$ cat /data/primary/gpseg0/db_dumps/20140909/gp_dump_0_2_20140909134119 
--
-- Greenplum Database database dump
--

SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

--
-- Greenplum Database database dump complete
--

So my incremental backup doesn’t have any data in it. If I had deleted or inserted any data to my one table, the entire table would be part of the backup set. That is why it is an incremental differential backup. If there is a difference in the AO table (or partition), the entire table/partition will be included in the incremental backup.

Restore
So now that I have one full backup and one incremental, let’s drop the table and then restore it.

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# DROP TABLE foo;
DROP TABLE
poc=# \q

And now the restore using gpdbrestore. I’m using the following switches:
-a means automatic and don’t prompt
-e will drop the target database and then recreate it
-s uses the latest backup set

bigmac:~ gpadmin$ gpdbrestore -a -e -s poc
20140909:13:42:58:006720 gpdbrestore:bigmac:gpadmin-[INFO]:-Starting gpdbrestore with args: -a -e -s poc
...

Verify the restore worked:

bigmac:~ gpadmin$ psql poc
psql (8.2.15)
Type "help" for help.

poc=# SELECT * FROM foo;
 id |  bar  
----+-------
  1 | bar_1
  2 | bar_2
(2 rows)

poc=# 

Conclusion
Both commands used here, gpcrondump and gpdbrestore, have lots of options to backup and restore respectively. Using the incremental switch on gpcrondump will save you disk space and execution time while still giving you the ability to do a full system restore or even a selective table restore.

Greenplum 4.3 External Tables to Hortonworks 2.0

The following instructions are valid for connecting Greenplum Database to Hortonworks HDP 2.0. Note that this technique is not officially supported by Pivotal and Pivotal has added supported for HDP 2.1 in Greenplum Database version 4.3.3 and 4.2.8. See this post for more information.

Hortonworks (HDP) 2.0 is based on Apache Hadoop 2.2 which is also what Pivotal HD 2.0 is based on. If you haven’t read already, I was able to demonstrate how Greenplum can read and write data to a Pivotal HD cluster in parallel. http://www.pivotalguru.com/?p=700. You can use this same basic configuration to integrate Grenplum to Hortonworks 2.0!

Follow the same steps in configuring Greenplum database for Pivotal HD 2.0. We’ll use the same client libraries to connect to HDP 2.0. Next create a user in HDP 2.0 named gpadmin. Lastly, create the same types of External Tables but use the IP address of the HDP cluster.

CREATE WRITABLE EXTERNAL TABLE ext_foo_hdp20
(i int, bar text) LOCATION ('gphdfs://192.168.239.215/user/gpadmin/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

INSERT INTO ext_foo_hdp20 SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;

CREATE EXTERNAL TABLE ext_get_foo_hdp20
(i int, bar text) LOCATION ('gphdfs://192.168.239.215/user/gpadmin/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

Here are the files Greenplum created in the Hortonworks cluster.

[gpadmin@sandbox ~]$ hadoop version
Hadoop 2.2.0.2.0.6.0-76
Subversion git@github.com:hortonworks/hadoop.git -r 8656b1cfad13b03b29e98cad042626205e7a1c86
Compiled by jenkins on 2013-10-18T00:19Z
Compiled with protoc 2.5.0
From source with checksum d23ee1d271c6ac5bd27de664146be2
This command was run using /usr/lib/hadoop/hadoop-common-2.2.0.2.0.6.0-76.jar
[gpadmin@sandbox ~]$ hdfs dfs -ls /user/gpadmin/foo_bar/
Found 2 items
-rw-r--r--   3 gpadmin gpadmin        490 2014-08-05 06:19 /user/gpadmin/foo_bar/0_1407266192-0000000091
-rw-r--r--   3 gpadmin gpadmin        494 2014-08-05 06:19 /user/gpadmin/foo_bar/1_1407266192-0000000091

Here is the data from HDP in Greenplum.

postgres=# SELECT * FROM ext_get_foo_hdp20 ORDER BY i LIMIT 10;
 i  |  bar   
----+--------
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

postgres=# 

I suspect the integration will get better and better between Greenplum and the various Hadoop distributions because for the most part, all of the distributions rely on the Apache distribution of Hadoop. Look for more to come in the future too!

HAWQ versus Greenplum Database

HAWQ is a port of Greenplum database to work with the HDFS file system from Hadoop. It is only available with the Pivotal HD distribution of Hadoop even though Pivotal HD is 100% Apache Hadoop compliant.

This post will focus on the major differences in the filesystem, External Tables, DDL commands, and DML commands. I’m comparing Greenplum database version 4.3 with HAWQ version 1.2.0.1 which is shipped with Pivotal HD version 2.0.1.

HDFS versus Posix Filesystem
With Greenplum database, each segment’s files are always local to the host. Even in a failover condition when the mirror is acting as the primary, the data is local to the processing.

With HAWQ, HDFS handles high availability by having three copies of the data across multiple nodes. Because of this, the mirroring that is built into the Greenplum database is removed from HAWQ. If a node were to fail, then Hadoop automatically creates a third copy of the data. So a segment running on a Hadoop data node may not have the data it needs local to it and will need to get data from other physical nodes.

External Tables
Both HAWQ and Greenplum database have External Tables but differ when accessing external data in Hadoop.

HAWQ has PXF which is the Pivotal Extension Framework. It has the ability to access files in HDFS stored as plain text but also in Hive, Hbase, Avro, and Gemfire XD. You can write your own custom profiles to get data from HDFS. PXF can also get some statistics about these files so the optimizer is smarter when accessing these External Tables.

Greenplum database doesn’t have PXF but does have GPHDFS. GPHDFS enables Greenplum database to read and write data to HDFS. It doesn’t have built-in capabilities to Avro, Hive, HBase, and Gemfire XD. It also doesn’t have statistics for these External Tables.

HAWQ is great at exploring and transforming data in Hadoop while Greenplum database is great at bulk loading data from Hadoop into the database as well as bulk writing data from Greenplum database into Hadoop. So land all of your data in Hadoop, transform it with SQL and then create data marts in Greenplum database.

Functions
Both offer functions but HAWQ doesn’t have SECURITY DEFINER functions yet.

DDL Commands
Here is a list of commands that are in Greenplum database but not in HAWQ:

  • CREATE AGGREGATE: user defined aggregate like SUM and COUNT.
  • CREATE CAST: user defined conversion of two datatypes.
  • CREATE CONVERSION: user defined conversion of character set encodings.
  • CREATE DOMAIN: user defined datatype with optional constraints.
  • CREATE INDEX: indexes aren’t supported in HAWQ.
  • CREATE OPERATOR: user defined operator like != is the same as <>.
  • CREATE OPERATOR CLASS: user defined class of how a data type is used within an Index.
  • CREATE RULE: user defined filter placed on a table or view like “gender_code in (‘M’, ‘F’)”
  • CREATE TABLESPACE: user defined directory to be using in Posix filesystem to store database objects.
  • CREATE TRIGGER: user defined trigger for a table. Note that this is very limited in Greenplum database.

DML Commands
HDFS is designed for “write once, read many” and can not handle file pruning which is required for DELETE and UPDATE commands. Because of this HAWQ doesn’t support UPDATE and DELETE commands while Greenplum database does.

Summary
There are other small differences between the two products but these are the major ones.

Pivotal HD and Greenplum Database Integration

So you are thinking about building a Hadoop cluster or already have one and wondering how you can get data to and from Hadoop into an MPP database like Greenplum Database (GPDB). With GPDB, this can be done in parallel which makes it the ideal solution for an Analytics, Reporting, or Data Warehousing environment.

In this example, I will generate some dummy data with a SQL statement in GPDB and load it into PHD with an INSERT statement. I will then read the data back out of PHD with a SELECT statement.

One time steps Configuration

1. Download Pivotal HD that matches your version.
https://network.gopivotal.com/products/pivotal-hd

2. Copy the file to the Master server in the GPDB cluster. In my example, gpdbvm43 is single node VM with 2 segments.

scp PHD-2.0.1.0-148.tar.gz root@gpdbvm43:/root

3. ssh as gpadmin to GPDB 4.3 instance

vi .bashrc
[add these entries]
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/gphd

Note: this needs to be repeated on every host in the cluster. You can use gpscp or gpssh to make these changes to all hosts too.

4. Make sure the database is running.

gpstart -a

5. Change the gp_hadoop_target_version to be compatible with Pivotal HD 1.0 and greater.

gpconfig -c gp_hadoop_target_version -v gphd-2.0

6. Add an entry to your /etc/hosts for all of the hosts in the Hadoop cluster. This needs to be done on all hosts in the GPDB cluster too. I’m using a single node VM of PHD so I just have one entry.

vi /etc/hosts 
[add every node in the PHD cluster]
192.168.239.203 pivhdsne.localdomain pivhdsne

Note: You can then use gpscp to copy the revised hosts file to the other hosts in the cluster.

7. Install the PHD client in the GPDB cluster.

su - 
tar --no-same-owner -zxvf PHD-1.1.0.0-76.tar.gz
cd PHD-2.0.1.0-148/utility/rpm
rpm -i *.rpm
cd ../../zookeeper/rpm
rpm -i *.rpm
cd ../../hadoop/rpm
yum install nc
rpm -i *.rpm
exit

Note: You can use gpscp to copy the tar.gz file to the other hosts in the cluster and then use gpssh to execute these commands. Be sure to source the greenplum_path.sh after connecting as root. “nc” may not be needed on your cluster but was required with my VM.

8. Now that you are gpadmin again, bounce the database.

gpstop -r

Pivotal HD Configuration
You likely already have this done but if you are using the single node VM of Pivotal HD, then you will need to edit your /etc/hosts file there so that Hadoop is accessible remotely.

1. ssh as root to the VM

[root@pivhdsne ~]# ifconfig
eth1      Link encap:Ethernet  HWaddr 00:0C:29:20:A3:8F  
          inet addr:192.168.239.203  Bcast:192.168.239.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe20:a38f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9168 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1199 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:653659 (638.3 KiB)  TX bytes:199320 (194.6 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:3779367 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3779367 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:4407394192 (4.1 GiB)  TX bytes:4407394192 (4.1 GiB)

Now vi /etc/hosts and change 127.0.0.1 to the IP address of the VM.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.239.203 pivhdsne pivhdsne.localdomain

GPDB External Tables
1. Example of Writable External Table to PHD.

CREATE WRITABLE EXTERNAL TABLE ext_foo
(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

2. Insert some data into PHD from GPDB.

INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;

3. Create External Table to view the data in PHD.

CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION ('gphdfs://pivhdsne/foo_bar') FORMAT 'text' (delimiter '|' null 'null');

4. Select the data.

postgres=# SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
 i  |  bar   
----+--------
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

Parallel!
Here is how you can see that it was done in parallel. Notice there are two files in the /foo_bar directory that I specified in the External Writable Table above.

[pivhdsne:~]$ hdfs dfs -ls /foo_bar
Found 2 items
-rw-r--r--   3 gpadmin hadoop        490 2014-05-24 00:19 /foo_bar/0_1400790662-0000004541
-rw-r--r--   3 gpadmin hadoop        494 2014-05-24 00:19 /foo_bar/1_1400790662-0000004541

There are two files because my single node VM of GPDB has two Segments. Each Segment wrote its file to Hadoop at the same time. Completely parallel and scalable.

More proof!

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/0_1400790662-0000004541 | more
1|bar_1
3|bar_3
5|bar_5
7|bar_7
9|bar_9
...

[pivhdsne:~]$ hdfs dfs -cat /foo_bar/1_1400790662-0000004541 | more
2|bar_2
4|bar_4
6|bar_6
8|bar_8
10|bar_10
...

Extra Credit!
Log into HAWQ on the PHD cluster and create an External Table to the same files!

CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION 
('pxf://pivhdsne:50070/foo_bar?profile=HdfsTextSimple') FORMAT 'text' (delimiter '|' null 'null');

gpadmin=# SELECT * FROM ext_get_foo ORDER BY i limit 10;
 i  |  bar   
----+--------
  1 | bar_1
  2 | bar_2
  3 | bar_3
  4 | bar_4
  5 | bar_5
  6 | bar_6
  7 | bar_7
  8 | bar_8
  9 | bar_9
 10 | bar_10
(10 rows)

As you can see, there are lots of ways to move data between Greenplum database and Hadoop to satisfy a large variety of use cases to solve business problems.

Outsourcer 4.1.0 Released!

Outsourcer now supports HAWQ! If you aren’t familiar with HAWQ, it is basically Greenplum Database that uses HDFS to store data. This makes Outsourcer even more useful as you can load data into Hadoop from SQL Server and Oracle by clicking through an easy to use web interface.

There are many more new features like the ability to specify Append-Optimized tables, compression, or column orientation. Check it out!

Zip File
Documentation
GitHub Source

Install Video:

When should I use Greenplum Database versus HAWQ?

I get this question a lot so I thought I would post the answer. Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

HAWQ
HAWQ is “HAdoop With Query” and is basically a port of Greenplum to store data natively in HDFS. This empowers everyone using Hadoop to use a friendlier and more robust programming interface to Hadoop. Tools for Hadoop have evolved over time because of the need to get more consumers using Hadoop in a more productive manner.

The HAWQ architecture is very similar right now with each segment having its own files but instead of storing these files local to each segment (shared nothing), the data is stored in HDFS. Over time, I believe this architecture will change to enable greater elasticity in the number of segments processing data but right now, Greenplum and HAWQ share this design.

Hadoop Architecture
There are lots of material online about how Hadoop works but I want to point out one of the principals of Hadoop and how it relates to having a relational database store its files in it. Simple Coherency Model is a “write-once-read-many access model for files” that HDFS uses to simplify the data model in Hadoop. Once you “put” a file into HDFS, you can’t edit it to change the file. You can remove the file but once you write it, that is it.

Database Features
There are many “small” features in Greenplum Database that either aren’t supported yet in HAWQ or will not because HDFS won’t allow it. These features need to be taken into consideration when choosing which solution to deploy on. For example, here is a short list of features not found in HAWQ but are in Greenplum database: LOCK TABLE, CREATE INDEX, CREATE TYPE, CREATE RULE, CREATE AGGREGATE, and CREATE LANGUAGE.

Two more commonly used features of Greenplum that are not in HAWQ are DELETE and UPDATE commands. When a database executes a DELETE or UPDATE, it is editing the file. It basically needs to mark the old row as removed which requires changing the file. This isn’t allowed in HDFS so consequently, UPDATE and DELETE commands are not allowed in HAWQ.

HAWQ Features
Now there are some features of HAWQ that don’t exist in Greenplum database that are also compelling. PXF is the Pivotal Extensible Framework that provides enhanced capabilities when working with data in HDFS from HAWQ. Simply put, you can create an External Table to HDFS that reads Text, Hive, HBase, and soon Parquet. These are optimized for these formats and can leverage partitioning and predicate push-down. Another feature of this is the ability to gather statistics on these external tables to HDFS. So if you a large amount invested in Hadoop already, HAWQ can leverage that investment and extend it with more capabilities.

How does it all fit together?
A lot of companies are embracing Hadoop as it is an inexpensive, commodity hardware based, scalable solution that can handle terabytes to petabytes worth of data. HAWQ makes Hadoop better by providing a robust SQL interface built on over 10 years development of the Greenplum database.

Companies are building their “data lakes” and using Pig, Hive, HBase, and now SQL to explore and transform their data. This is where HAWQ excels.

Once data is transformed, companies are loading the relational data for consumption into Data Marts or Enterprise Data Warehouses. This is where Greenplum database comes in by leveraging a robust MPP database to provide historical Business Intelligence capabilities and forward looking predictive analytics all in-database with SQL.

Architecture

Seeing “Invisible” Data

Note: Pivotal Support requested that I mention that this GUC, along with all other hidden GUCs, are not supported. Being able to see deleted data has some benefit in extreme situations but can get you into trouble if used improperly. Do not open a support ticket on this GUC or any other hidden GUC as Pivotal Support will not be able to help. It is hidden for a reason!

I stumbled on this neat little feature today from a hidden GUC (Grand Unified Configuration). If you aren’t familiar with the term GUC, it is basically a configuration value that can be set for a session with “SET” or in the postgresql.conf file using gpconfig.

First, let’s create a table and insert some data.

gpdb=# CREATE TABLE foo (id int NOT NULL, bar text NOT NULL) DISTRIBUTED BY (id);
CREATE TABLE
gpdb=# INSERT INTO foo SELECT i, 'test:' || i FROM generate_series(1, 10) AS i;
INSERT 0 10

Now select the data.

gpdb=# SELECT * FROM foo ORDER BY id;
 id |   bar   
----+---------
  1 | test:1
  2 | test:2
  3 | test:3
  4 | test:4
  5 | test:5
  6 | test:6
  7 | test:7
  8 | test:8
  9 | test:9
 10 | test:10
(10 rows)

Now do something like delete all of the data.

gpdb=# DELETE FROM foo;
DELETE 10

Oh no! I need that data still! How can I get it back?

gpdb=# SET gp_select_invisible = TRUE;
SET
gpdb=# SELECT * FROM foo ORDER BY id;
 id |   bar   
----+---------
  1 | test:1
  2 | test:2
  3 | test:3
  4 | test:4
  5 | test:5
  6 | test:6
  7 | test:7
  8 | test:8
  9 | test:9
 10 | test:10
(10 rows)

As you can see, the old data is still there but marked as deleted. You can still see it by turning this hidden GUC on.

Now here it is turned off to see the data normally.

gpdb=# SET gp_select_invisible = FALSE;
SET
gpdb=# SELECT * FROM foo ORDER BY id;
 id | bar 
----+-----
(0 rows)

You can use this to see non-committed data being inserted into a heap table but not an append-only table. Once you vacuum or truncate a table, the invisible data gets removed so you won’t be able to see it anymore. BTW, this works in HAWQ too!

Outsourcer 4.0.2 Released

Outsourcer 4.0.2 is now available for download.

New to 4.0.2

  • Corrected Open Source license file. It is now using a BSD license and the NanoHTTPD license (web server).
  • Corrected install.sql file that incorrectly had $BODY$ for one function in the declaration.
  • Corrected cookie conflict with Command Center.
  • Reduced the number of database calls when using the Web interface.
  • Removed updates from the os.sessions table for the Web interface.
  • Links

  • 4.0.2 Zip
  • 4.0.2 Administrator’s Guide
  • 4.0.2 Source Code
  • Greenplum Database 4.3

    Pivotal recently released this new version without any major announcements but customers should definitely check it out! There are three main enhancements that I will cover in this blog post.

    1. Append Optimized Tables
    2. WAL Standby Replication
    3. Orca Optimizer

    Append-Optimized Tables
    Greenplum database has had “Append-Only” tables for several versions. It only allowed for inserting new data to a table which has a savings of about 20 bytes per row because it doesn’t need the overhead of per-row visibility. Append-Only tables allows you to use compression or change the orientation to column so it is used for larger tables in the database.

    With Greenplum Database 4.3, Append-Only tables are now Append-Optimized. You still get the benefits of an Append-Only table but now you will be able to UPDATE and DELETE from these tables as well.

    Here is an example with a compressed and column oriented table that is Append-Optimized:

    gpdb=# CREATE TABLE bar (a int, b text)
    gpdb-# WITH (appendonly=true, orientation=column, compresstype=quicklz)
    gpdb-# DISTRIBUTED BY (a);
    CREATE TABLE
    --Notice how it is still using "appendonly" in the syntax
    

    INSERT some data and SELECT from this new table.

    gpdb=# INSERT INTO bar VALUES (1, 'jon');
    INSERT 0 1
    gpdb=# INSERT INTO bar VALUES (2, 'roberts');
    INSERT 0 1
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     2 | roberts
     1 | jon
    (2 rows)
    

    UPDATE example:

    gpdb=# UPDATE bar SET b = UPPER(b);
    UPDATE 2
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     1 | JON
     2 | ROBERTS
    (2 rows)
    

    DELETE example:

    gpdb=# DELETE FROM bar WHERE a = 1;
    DELETE 1
    gpdb=# SELECT * FROM bar;
     a |    b    
    ---+---------
     2 | ROBERTS
    (1 row)
    

    This looks just like any other table but the difference is in the row versioning. Instead of using MVCC (which was borrowed from PostgreSQL), Append-Optimized tables have a Visibility Map which is a bitmap of what rows are no longer visible due to a DELETE or UPDATE statement on the table.

    Because it isn’t using MVCC anymore, serializable transactions with UPDATE and DELETE to an Append-Optimized table are not allowed. A serializable transaction is a strict transaction isolation in which transactions execute as if they run one after another rather than concurrently. Serializable is almost never used in a big data platform so it isn’t a big deal.

    Another change is how VACUUM works. Instead of marking stale rows as deleted so that row can be overwritten (MVCC), VACUUM compacts the physical files and resets the visibility map.

    When should use use a Heap table (default) or an Append-Optimized table? If you have any OLTP work in Greenplum, Heap is better. If you have frequent UPDATE or DELETE statements, heap is likely better because VACUUM should run faster. Query performance may also be a bit better for smaller tables that are Heap. So for larger tables that have no or infrequent UPDATE or DELETE statements, use Append-Optimized. If you want to compress the table or make it column oriented, you must make it Append-Optimized.

    WAL Standby Replication
    Greenplum database has a Master server where users connect and execute queries. The Master server manages sessions, security, metadata, and query plans. Queries are executed on the Segment servers in the Greenplum cluster. To provide high availability, the Master server has a Standby-Master server. Prior to 4.3, Greenplum used block level replication to keep the data on the Master in sync with Standby-Master.

    With 4.3, the WAL (Write-Ahead Log) is now streamed to the Standby-Master that is in a standby mode. This provides faster promotion of the Standby-Master in the event where the Master fails. It also has an improvement in performance for the Master server.

    Orca
    Engineering has refactored the query optimizer to make it easier to maintain and enhance. They also took this opportunity to enhance this optimizer to handle certain edge cases where the optimizer picked a poor plan. It does a better job at handling certain scenarios with partitioned tables, subqueries, and common table expressions.

    Orca should not be used in Production in release 4.3. To use it, you need to be part of the Early Access Program (EAP). It is a very simple command to enable it but we want customers to provide feedback before you make Orca the default optimizer. Contact your Pivotal Field Engineer to get into this program and if you don’t know who this is, contact me and I’ll help you.

    Summary
    Greenplum database 4.3 is a great release with many new features. Customers can download it now from emc.subscribenet.com and documentation from support.emc.com.

    Advanced Memory Management

    Overview
    In the Installation Guide, it states (for RHEL 6) that the sysctl.conf file needs the following values:

    xfs_mount_options = rw,noatime,inode64,allocsize=16m 
    kernel.shmmax = 500000000
    kernel.shmmni = 4096
    kernel.shmall = 4000000000
    kernel.sem = 250 512000 100 2048
    kernel.sysrq = 1
    kernel.core_uses_pid = 1
    kernel.msgmnb = 65536
    kernel.msgmax = 65536
    kernel.msgmni = 2048
    net.ipv4.tcp_syncookies = 1 
    net.ipv4.ip_forward = 0 
    net.ipv4.conf.default.accept_source_route = 0 
    net.ipv4.tcp_tw_recycle = 1 
    net.ipv4.tcp_max_syn_backlog = 4096 
    net.ipv4.conf.all.arp_filter = 1 
    net.ipv4.ip_local_port_range = 1025 65535 
    net.core.netdev_max_backlog = 10000 
    vm.overcommit_memory = 2
    

    The kernal.shmmax, kernal.shmall, and vm.overcommit_memory settings are basically allowing the operating system cache management memory. Quoting Luke Lonergan (one of the founders of Greenplum), “The automatic IO cache in Linux will cache the first IO to tables that fit and subsequent queries will use the cache provided the tables aren’t ejected.” This caching is automatic too. So this is very different from Oracle for example, where you manage the “SGA” rather than letting the OS handle the caching.
    +1 for Greenplum

    This also means it is significantly easier to manage memory. You won’t be running to the init.ora file to manage the SGA but rather, you’ll let the OS handle the caching.
    +1 for Greenplum

    Eager Free Memory Policy
    The “gp_resqueue_memory_policy” setting is in the postgresql.conf file and it set on the master only. You can also set it in your session which may be helpful for large transformation statements.

    The default is “eager_free” and was introduced in Greenplum 4.2. It exploits the fact that all slices won’t be executing at the same time. It divides the query plan into stages and memory is distributed to each stage knowing that subsequent slices won’t execute until the blocking operators have completed. Leaving this the default value is typically the best idea.

    statement_mem
    If you have very large queries that need more memory, you can change statement_mem to use more memory rather than spilling to disk. You can see a query wanting more memory by looking at the explain plan of a query by using “explain analyze”. For example:

    EXPLAIN ANALYZE SELECT * FROM .....;
    

    The output will show the plan used but a key item to look for is “bytes wanted”. When you see this, it means that Greenplum had to spill to disk because there wasn’t enough memory available. The best approach is likely to rewrite the query. Alternatively, you can increase the amount of memory available.

    The maximum value you can set for statement_mem is determined by “max_statement_mem”. The default max_statement_mem is 2MB.

    On the Master, execute the following to increase the statement_mem:

    gpconfig -c max_statement_mem -v 16GB
    gpstop -u
    

    Now, you can change the memory setting in your session. You can also do this with gpconfig to make the setting for all sessions.

    set statement_mem = '8GB';
    

    Re-run your query and see if it executes faster and if it still has “bytes wanted” in the query plan.

    Compressed Work Files
    If you know you are spilling to disk when executing queries because EXPLAIN ANALYZE showed that more bytes were wanted than available, you can trade CPU for IO by compressing the work files. This is is done with “gp_workfile_compress_algorithm”. The default value is “none” but you can change this to “zlib”. It can be done at the session or with gpconfig to make it system wide.

    Temporary Tables
    Another way to deal with very large queries that spill to disk is to use temporary tables that are compressed. This is ideal when you use a subquery that is then joined to other tables. If you know it is spilling to disk (again from EXPLAIN ANALYZE showing more bytes wanted than available), you can populate a compressed temporary table instead. For example:

    CREATE TEMPORARY TABLE foo
    (myid int, bar text)
    WITH (APPENDONLY=true, COMPRESSTYPE=quicklz) ON COMMIT DROP 
    DISTRIBUTED BY (myid);
    

    Summary
    Greenplum is pretty easy to manage memory because it has been designed to leverage the OS caching. The default Eager Free Memory Policy and default statement_mem works very well for most all queries in the database. However, if you do see queries still need more memory than is available, you can increase the statement_mem. If you are still spilling to disk because your statement needs more memory, you can have Greenplum automatically compress work files and/or use compressed temporary tables.

    Outsourcer 4.0.1

    Outsourcer 4.0.1 is mainly bug fixes and enhancements for Oracle.

    Changed the default Oracle Fetch Size from 40,000 to 2000. 2000 seems to be the point of diminishing returns for any values greater. The Oracle default is only 10, which makes exporting data very slow. The higher the value, the more memory is needed and the data exporting goes faster. But a fetch size of more than 2000 doesn’t improve performance but it does consume more memory.

    Separated the Greenplum JDBC driver into a separate JAR file so that it is possible to upgrade just this driver.

    Separated the User Interface classes into a separate JAR file so it needs less memory.

    Separated the Scheduler classes into a separate JAR file so it needs less memory.

    Separated the NanoHttpd classes into a separate JAR file so it can be upgraded independently of Outsourcer.

    Fixed a bug in the UI where the SQL Text Box wasn’t visible when creating a Transform Job.

    Fixed a bug in the UI where quotes weren’t escaped properly for SQL entered into the SQL Text Box.

    Download

    Documentation

    Outsourcer 4.0

    PivotalGuru.com is pleased to announce Outsourcer 4.0!

    Overview Video

    New Features

  • Web based User Interface
  • Recurring Job Scheduling
  • Transform Jobs execution like other job types
  • Transform SQL added to every job type to execute last
  • Dependent Jobs are now supported
  • Dynamic Environment Variables
  • Check out the new Administrator Guide for more information.

    Link to Outsourcer 4.0.

    Checking for “Skew”

    Shared Nothing / MPP Database Architecture
    Greenplum database is a “Shared Nothing” database in which data is spread out across many segments located on multiple segment hosts. No two segments have the same data and data is distributed based on a distribution key.

    For example, let’s say you have a customer table in Greenplum with the distribution key of (customer_id). Your Greenplum cluster has 16 segments so your data would actually be physically located like this:

    Sement Customer ID
    0 1
    1 2
    2 3
    3 4
    4 5
    5 6
    6 7
    7 8
    8 9
    9 10
    10 11
    11 12
    12 13
    13 14
    14 15
    15 16

    When a query is executed like “SELECT COUNT(*) FROM customer”, there will be 16 processes created (one for each segment) where the number of rows are counted in parallel across all segments. This is the design behind a Shared Nothing or Massively Parallel Processing (MPP) database.

    Skew
    So what is skew? In my customer table example, I have even distribution. That is, each segment has the same number of rows. If I distributed instead by state_code and all of my customers are in Tennessee, then all of the data would be in a single segment and thus be skewed.

    Having tables that are skewed will slow down processing significantly because it isn’t taking advantage of the entire cluster.

    Do I need to Check For Skew?
    If you set your distribution key to the logical or physical Primary Key of the table, checking for skew isn’t needed. This will give you great distribution of data.

    There are times when you don’t have a PK on a table or you want to set the distribution key to something else because you commonly join two very large tables together. For example, customer and orders tables may perform better if distributed by customer_id so that the data is collocated. However, if the majority of your orders are by one customer, you will have skew in the orders table.

    Checking for Skew Method 1
    This is the most common and straight forward way to check for skew.

    SELECT gp_segment_id, COUNT(*)
    FROM customer
    GROUP BY gp_segment_id
    ORDER BY 1;
    

    This will count the rows per segment and give you a good indication of the skew. The downside to this is it physically counts the rows for this one table at at time. This takes time especially if you have a large number of tables.

    Checking for Skew Method 2
    I developed this second approach which looks at the file sizes for each table for each segment. It then will output only the tables that have at least one segment with more than 20% more bytes than expected.

    Execute the following script in your database.

    CREATE OR REPLACE FUNCTION fn_create_db_files() RETURNS void AS
    $$
    DECLARE 
            v_function_name text := 'fn_create_db_files';
            v_location int;
            v_sql text;
            v_db_oid text;
            v_num_segments numeric;
            v_skew_amount numeric;
            v_version text;
    BEGIN
            v_location := 1000;
            SELECT oid INTO v_db_oid 
            FROM pg_database 
            WHERE datname = current_database();
    
            v_location := 2000;
            v_sql := 'DROP VIEW IF EXISTS vw_file_skew';
    
            v_location := 2100;
            EXECUTE v_sql;
            
            v_location := 2200;
            v_sql := 'DROP EXTERNAL TABLE IF EXISTS db_files';
    
            v_location := 2300;
            EXECUTE v_sql;
    
            v_location := 3000;
            v_sql := 'CREATE EXTERNAL WEB TABLE db_files ' ||
                    '(segment_id int, relfilenode text, filename text, ' ||
                    'size numeric) ' ||
                    'execute E''ls -l $GP_SEG_DATADIR/base/' || v_db_oid || 
                    ' | ' ||
                    'grep gpadmin | ' ||
                    E'awk {''''print ENVIRON["GP_SEGMENT_ID"] "\\t" $9 "\\t" ' ||
                    'ENVIRON["GP_SEG_DATADIR"] "/' || v_db_oid || 
                    E'/" $9 "\\t" $5''''}'' on all ' || 'format ''text''';
    
            v_location := 3100;
            EXECUTE v_sql;
    
            v_location := 4000;
            SELECT count(*) INTO v_num_segments 
            FROM gp_segment_configuration 
            WHERE preferred_role = 'p' 
            AND content >= 0;
    
            v_location := 4100;
            v_skew_amount := 1.2*(1/v_num_segments);
            
            v_location := 4200;
            SELECT CASE WHEN POSITION('Greenplum Database 5' in v) > 0 THEN 'GPDB5'
                     WHEN POSITION('Greenplum Database 4' in v) > 0 THEN 'GPDB4'
                     ELSE 'other' END INTO v_version
            FROM version() AS v;
    
            v_sql := 'CREATE OR REPLACE VIEW vw_file_skew AS ' ||
                     'SELECT schema_name, ' ||
                     'table_name, ' ||
                     'max(size)/sum(size) as largest_segment_percentage, ' ||
                     'sum(size) as total_size ' ||
                     'FROM        ( ' ||
                     'SELECT n.nspname AS schema_name, ' ||
                     '      c.relname AS table_name, ' ||
                     '      sum(db.size) as size ' ||
                     '      FROM db_files db ';
    
            IF v_version = 'GPDB4' THEN
                     v_sql := v_sql || '      JOIN pg_class c ON ';
            ELSE
                     v_sql := v_sql || '      JOIN gp_dist_random(''pg_class'') c ON ';
            END IF;
    
            v_sql := v_sql || '      (split_part(db.relfilenode, ''.'', 1))::text = c.relfilenode::text ' ||
                     '      JOIN pg_namespace n ON c.relnamespace = n.oid ' ||
                     '      WHERE c.relkind = ''r'' and c.relstorage not in (''x'', ''v'', ''f'')' ||
                     '      GROUP BY n.nspname, c.relname, db.segment_id ' ||
                     ') as sub ' ||
                     'GROUP BY schema_name, table_name ' ||
                     'HAVING sum(size) > 0 and max(size)/sum(size) > ' || 
                     v_skew_amount::text || ' ' || 
                     'ORDER BY largest_segment_percentage DESC, schema_name, ' ||
                     'table_name';
    
            v_location := 4300;
            EXECUTE v_sql; 
    
    EXCEPTION
            WHEN OTHERS THEN
                    RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
    END;
    $$
    language plpgsql;
    

    Now execute the function so it will create the necessary objects in your database.

    SELECT fn_create_db_files();
    

    Now select from the view to see the tables that are skewed.

    SELECT * FROM vw_file_skew ORDER BY 3 DESC;
    

    Summary
    When tuning a database, a Greenplum DBA will typically only need to worry with skew instead of tuning queries like other databases. Once the tables are created with good distribution keys, the tuning job is 90% or more done. If you think one particular table is not distributed well, use Method 1. If you need to validate an entire database quickly, use Method 2.

    Note: Edited to support GPDB5.
    Note: Edited again to fix for GPDB5.

    Autonomous Transactions for Logging v2

    Here is another way that is fault tolerant to a segment host failure.

    Step 1

    DROP EXTERNAL TABLE IF EXISTS error_log;
    
    CREATE WRITABLE EXTERNAL WEB TABLE error_log
    (
    function_name text,
    location int,
    error_message text
    )
    EXECUTE '/data/log_error.sh'
    FORMAT 'CSV' DISTRIBUTED RANDOMLY;
    
    

    Step 2
    Create /data/log_error.sh on every segment host and be sure to chmod 755 the file once created.

    read in
    
    function_name=$(echo $in | awk '{split($0,arr,",")} END{print arr[1]}')
    location=$(echo $in | awk '{split($0,arr,",")} END{print arr[2]}')
    error_message=$(echo $in | awk '{split($0,arr,",")} END{print arr[3]}')
    
    psql -h mdw -c "INSERT INTO my_errors values ('$function_name', $location, '$error_message')"
    

    Note: I’ll admit that there is probably a more efficient way to parse $in but this works.

    Step 3
    Create the my_errors table.

    DROP TABLE IF EXISTS my_errors;
    
    CREATE TABLE my_errors
    (
    function_name text,
    location int,
    error_message text
    )
    DISTRIBUTED RANDOMLY;
    

    Step 4
    Insert some data to the error_log External Web Table that isn’t influenced by the transaction.

    BEGIN;
    
    INSERT INTO error_log VALUES ('fn_build_mart', 7500, 'ERROR:  division by zero');
    
    ROLLBACK;
    

    Step 5
    Verify that the data is in the error_log table.

    SELECT * FROM my_errors;
    

    Dimensional Modeling

    I got an email asking me what my thoughts are on dimensional modeling so I thought it would be better to make a post about this rather than an email. Thanks Allan for the inspiration.

    First, I think it would be very wise to read books from Ralph Kimball and Bill Inmon to understand the approach to data modeling for the Enterprise Data Warehouse. Dimensional modeling can be good if not great but it does have drawbacks. First, let’s talk about what it is.

    Overview
    Dimensional modeling is a data model that is broken up into Dimensions and Facts. Dimensions have hierarchies like Country, Region, State, and City while Facts have measures like quantity, price, cost, and score. This approach makes it simple for a BI tool to take a person’s input in a GUI and translate that to SQL that is reasonably sure it will get the correct result. If you have a good dimensional model then you can switch BI Tools rather easily too.

    Building the Model
    How you build the model is argued by both Inmon and Kimball fans. One approach is to take various disparate data models from source databases and merge this into a common third normal form (3NF) structure. Then you can create dimensional models “easily” based on the various customers’ needs. The other approach is to skip the 3NF model and build sets of dimensional models with conforming dimensions for the entire enterprise.

    Kimball
    I have found that the Kimball approach works better for the projects I’ve worked on. He does a great job in his books to stress the importance of having a strong sponsor so that IT doesn’t build a dimensional model they think the business wants. That is great advice. The problem with this is the “loser” of any dispute on how the model should be built won’t use the model. Terms are frequently overloaded in organizations and mean different things to different people. Getting everyone to agree on virtually every term in an organization is not usually feasible. So you end up with a model that only one group uses and it took too long to build because of the interviews and discussions.

    Inmon
    This approach build that 3NF structure that takes the disparate data into a single model. So you acquire a new system or company and then that new data has to fit into this model. How long does that take to do? How much rework is there every time there is something that you hadn’t thought of?

    Golden Triangle
    A basic concept in project management is the Golden Triangle. It simple refers to the 3 variables in any project which are time, resources, and deliverables. You can’t control all 3 but you can control 2. You have to give some with one of these variables.

    Most likely you have a fixed set of resources like 3 developers. You also have a set deliverable; build a dimensional model. The last variable is time and you can’t control this unless you either add more developers or deliver less of the model. This is key in understanding why a dimensional model has a big flaw. It takes too long to build it.

    So what does the business users do while they wait for you to build their data model?

    Time to Market
    Time is everything in organizations. How fast can I get to my data? How fast can my queries execute? Making business users wait for you to build a model eats up time which equates to money. So why don’t you get out of the way?

    My approach
    I think most business users are smart and I think they know SQL. I’ve never been in an organization that doesn’t have SQL savvy business users. So if they are smart and know how to program enough to get their job done, why don’t you just give them the data?

    I think the best model is a simple one. Just replicate the transactional systems into a large database like Greenplum and even Pivotal Hadoop now that it has HAWQ. Business users will query the data and then create their own de-normalized tables. You got out of their way and they got their job done much faster than if you created a dimensional model for them. You might build a dimensional model to make it easier to use with some BI tools after they have access to the data.

    Most databases crumble on thought of letting users just execute any “bad” query they want. That is because most databases are designed for Online Transaction Processing (OLTP) and not big data. Look at my Tuning Guide for Greenplum to see how easy it is to maintain a rather robust database platform.

    Direction of Industry
    More and more organizations are embracing this approach. With Pivotal Hadoop or Greenplum database, you load all of the data into one cluster that scales easily, get great insights out of the data, have excellent time to market, have little to no development costs, and have very little administrator costs.

    Businesses can’t wait for a long running queries and they certainly can’t wait for you to build a dimensional model.

    Autonomous Transactions for Logging

    Updated July 17, 2013
    ***************
    1. External Web Table to read data should run on HOST, not MASTER.
    2. The /data/log_error.sh file needs to be created on every segment host.
    3. Need to chmod 755 the .sh file on the segment hosts.
    4. Need to touch /data/error.log on all of the segment hosts because the file needs to exist to prevent an error reading from an empty table.
    ***************

    An Autonomous Transaction is not a feature of Greenplum database but it can be achieved with some of the features unique to Greenplum. If you aren’t familiar with an Autonomous Transaction, it is basically a new transaction that is created within a transaction but it isn’t tied to your first transaction.

    What is the point? Well, in database programming, you may run into an EXCEPTION that you want to RAISE to the calling program but you also want to log the error. Raising an ERROR in Greenplum rolls back your transaction so the INSERT to a log table also gets rolled back.

    Here is a way to create a log table that enables you to RAISE an ERROR in your code but the INSERT doesn’t get rolled back!

    First create the error_log table:

    DROP EXTERNAL TABLE IF EXISTS error_log;
    
    CREATE WRITABLE EXTERNAL WEB TABLE error_log
    (
    function_name text,
    location int,
    error_message text
    )
    EXECUTE '/data/log_error.sh'
    FORMAT 'CSV' DISTRIBUTED RANDOMLY;
    

    Next create the /data/log_error.sh file:

    read in
    echo $in >> /data/error.log
    

    Now you can read this error.log file if you want or you can create another EXTERNAL TABLE to read the data like this:

    DROP EXTERNAL TABLE IF EXISTS my_errors;
    
    CREATE EXTERNAL WEB TABLE my_errors
    (
    function_name text,
    location int,
    error_message text
    )
    EXECUTE 'cat /data/error.log' ON /*MASTER*/ HOST
    FORMAT 'CSV';
    

    So now to test this out:

    gpdb=# BEGIN;
    BEGIN
    gpdb=# INSERT INTO error_log VALUES ('fn_build_mart', 7500, 'ERROR:  division by zero');
    INSERT 0 1
    gpdb=# ROLLBACK;
    ROLLBACK
    gpdb=# SELECT * FROM my_errors;
     function_name | location |      error_message      
    ---------------+----------+-------------------------
     fn_build_mart |     7500 | ERROR: division by zero
    (1 row)
    
    

    So even though I rolled back my transaction, the INSERT into the EXTERNAL WRITABLE WEB TABLE in Greenplum didn’t get rolled back and thus, you created an Autonomous Transaction in Greenplum.

    Outsourcer 3.1 and New Name!

    Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.

    What’s new?
    Oracle

    • FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.

    SQL Server

    • Fix to better handle non-default schemas

    DDL Refresh Type

    • Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.

    New Name: PivotalGuru.com

    Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!

    Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.

    Hadoop Data Lake and Transforming Data

    A Data Lake is a term describe a large enterprise repository of data stored in Hadoop. More and more companies are concluding that a Data Lake is the right solution over a traditional ETL process and a restricted EDW. The Data Lake is inexpensive, scales easily, uses commodity hardware, provides a very flexible schema, and enables an easy way to transform data in parallel.

    Schema
    So why not use a relational database for this Data Lake? In a relational database, the schema is defined first and then data is forced into it. With Hadoop, you first load the data and then apply a schema as you read it out. This means adding new data to Hadoop is easier and faster because you don’t have to first define the schema.

    Parallel
    Processing is also in parallel. You can transform the data using Hadoop tools like Pig to then be loaded into a relational data store or just use it in Hadoop.

    Greenplum Database External Tables
    There are use cases where a relational database like Greenplum database is easier to use and performs better than Hadoop. A great feature of Greenplum database is the ability to create an External Table to Hadoop. These External Tables can be defined to either READ or WRITE data to Hadoop. Because Greenplum is an MPP database, each segment connects to Hadoop to READ/WRITE data. This makes the integration of Greenplum Database with Hadoop much faster than a single threaded approach that you might see with other database products.

    Transform Example
    So let’s say you get an employee file from a company you just acquired. You want to take this file and then make it available to the Enterprise in Hadoop as well as a data mart in Greenplum database.

    We will first load the data into a stage directory, run a program to remove a control character, and then put it in the enterprise directory.

    hadoop dfs -mkdir /stage
    hadoop dfs -mkdir /enterprise
    

    Next, I use a Pig program to remove a control character. In this example, I’m removing the “null character”.

    The name of this file is pig_cleaning.pig.

    dirty = load '$STAGE' as (data:chararray);
    clean = foreach dirty generate REPLACE(data, '\\u0000', '');
    store clean into '$TARGET';
    

    Here is my employees.txt file I created. Notice the ^@ symbols. These are control characters as they appear when using vi. The pig_cleaning.pig script will remove these control characters.

    Jon|Roberts|123 Main Street|New York|NY|10202|Sales Engineer|NorthEast
    Abby|Cromwell|77 Juniper Drive|Raleigh|NC|27605|CIO|NorthWest
    Lilly|Vargas|7894 Rayna Rd|Atlanta|GA|30301|VP Sales|SouthEast
    Emily|Palmer|224 Warren Drive|St. Louis|MO|63101|VP Marketing|MidWest
    Winston|Rigby|84 West Road|^@|CA|^@|President|West
    

    Now I created a shell script that accepts a filename to load and the name of the directory in Hadoop to store the results in.

    The name of this file is load.sh.

    #!/bin/bash
    
    # $1 is the filename
    # $2 is the target directory name
    hadoop dfs -put $1 /stage
    pig -param STAGE=/stage/$1 -param TARGET=/enterprise/$2 pig_cleaning.pig
    hadoop dfs -rm /stage/$1
    

    This script loads the file into the /stage directory, runs the pig program to clean the file of the null character (^@), stores the output to the /enterprise directory, and then removes the stage file.

    Executing the script is as easy as:

    ./load.sh employees.txt employees
    

    Now what about Greenplum database? Here is how you can READ that data in Hadoop from the database. Note that in this example, I have Hadoop and Greenplum database on the same single host. Typically, these will be on separate hosts and instead of localhost, you would have the name of the NameNode like hdm1.

    create schema enterprise;
    
    create external table enterprise.employees
    (fname text,
     lname text,
     address1 text,
     city text,
     state text,
     zip text,
     job text,
     region text)
     LOCATION ('gphdfs://localhost:8020/enterprise/employees/part*')
     FORMAT 'TEXT' (delimiter '|');
    
    

    And now let’s execute a SELECT statement.

    gpdb=# select * from enterprise.employees;
      fname  |  lname   |     address1     |   city    | state |  zip  |      job       |  region   
    ---------+----------+------------------+-----------+-------+-------+----------------+-----------
     Jon     | Roberts  | 123 Main Street  | New York  | NY    | 10202 | Sales Engineer | NorthEast
     Abby    | Cromwell | 77 Juniper Drive | Raleigh   | NC    | 27605 | CIO            | NorthWest
     Lilly   | Vargas   | 7894 Rayna Rd    | Atlanta   | GA    | 30301 | VP Sales       | SouthEast
     Emily   | Palmer   | 224 Warren Drive | St. Louis | MO    | 63101 | VP Marketing   | MidWest
     Winston | Rigby    | 84 West Road     |           | CA    |       | President      | West
    (5 rows)
    
    

    Conclusion
    Hadoop is being used by Enterprises to create a Data Lake. Once there, it is fast and easy to transform the data. And with Greenplum database, it is easy to use SQL tools to access the data.

    Sending Emails

    A customer recently asked me how can they send an email from the command line to alert the DBAs when a backup starts or finishes. It can also be used to notify when something fails. So, here is my Python script that sends an email. You have to specify the SMTP server and this version assumes the SMTP doesn’t need credentials to send an email.

    #!/usr/bin/env python
    import sys, smtplib, string, getopt
    
    def send_my_mail (email_smtp, email_from, email_to, email_subject, email_body):
    
      body = string.join((
        "From: %s" % email_from,
        "To: %s" % email_to,
        "Subject: %s" % email_subject,
        "",
        email_body
      ), "\r\n")
    
      s = smtplib.SMTP(email_smtp)
      s.sendmail(email_from, email_to, body)
      s.quit()
    
    def main(argv):
       email_from = ''
       email_to = ''
       email_subject = ''
       email_body = ''
       email_smtp = ''
       if len(sys.argv)<11:
             print 'mail.py -f  -t  -s  -b  -d '
             sys.exit(2)
       try:
          opts, args = getopt.getopt(argv,"hf:t:d:b:s:")
       except getopt.GetoptError:
          print 'mail.py -f  -t  -s  -b  -d '
          sys.exit(2)
       for opt, arg in opts:
          if opt == '-h':
             print 'mail.py -f  -t  -s  -b  -d '
             sys.exit()
          elif opt in ("-f"):
             email_from = arg
          elif opt in ("-t"):
             email_to = arg
          elif opt in ("-d"):
             email_smtp = arg
          elif opt in ("-b"):
             email_body = arg
          elif opt in ("-s"):
             email_subject = arg
    
       send_my_mail (email_smtp, email_from, email_to, email_subject, email_body)
    
    if __name__ == "__main__":
       main(sys.argv[1:])
    
    

    Most of the code is actually parsing the command line arguments. Wouldn’t it be nice to instead use a function in the database that doesn’t need to parse command line arguments? That is exactly what I did! I next created a function in Greenplum that sends an email too!

    First you need to install plpythonu if you haven’t already.

     CREATE PROCEDURAL LANGUAGE 'plpythonu'
      HANDLER plpython_call_handler;
    

    Next, you create the function:

    create or replace function fn_send_email(email_smtp text, email_from text, email_to text, email_subject text, email_body text) returns void as 
    $$
      import smtplib, string
      body = string.join(("From: %s" % email_from, "To: %s" % email_to, "Subject: %s" % email_subject, "", email_body), "\r\n")
      s = smtplib.SMTP(email_smtp)
      s.sendmail(email_from, email_to, body)
      s.quit()
    $$ 
    language plpythonu;
    
    

    All of that code around parsing the command line is gone. It is very simple and easy to understand too.

    Below is the function being used. Note: I used fake values for the SMTP server, email from, and email to.

    select fn_send_email('smtp_example.email.com', 'email_from@email.com', 'email_to@email.com', 'test subject', 'This is the body of my test message');
    

    How Do I Use Java In Greenplum?

    This is the sixth in a series of “How do I do X in Greenplum database” and today’s topic is How do I use Java in Greenplum.

    Greenplum supports many languages and one of these is Java. This is defined as a language called “pljava”. Because it is Java, it still requires you to compile the code and put it in a jar that is your classpath. There are a couple of additional steps to configure this in Greenplum so I will go through these steps.

    Prerequisites:
    1. JDK installed either on the master or a desktop. If on a desktop, you’ll need to copy the jar file to the master so you can then copy it to the segments.
    2. Logged into the master (mdw) as user gpadmin.
    3. Methods must be public and static.

    First step, write some Java and here is an example of a file named Example.java:

    public class Example
    {
            public static String substring(String text, int beginIndex, int endIndex)
            {
                    return text.substring(beginIndex, endIndex);
            }
    }
    

    Create a manifest file named manifest.txt:

    Manifest-Version: 1.0
    Main-Class: Example
    Specification-Title: "Example"
    Specification-Version: "1.0"
    Created-By: 1.6.0_35-b10-428-11M3811
    Build-Date: 09/28/2012 10:09 AM
    

    Compile the Java:

    javac *.java
    

    Jar the file:

    jar cfm analytics.jar manifest.txt *.class
    

    Copy the analytics.jar to all Greenplum servers. The gphosts_file contains a list of all of your severs. An example of that is:

    mdw
    smdw
    sdw1
    sdw2
    sdw3
    sdw4
    

    And the command to copy the Jar file:

    gpscp -f gphosts_file analytics.jar =:/usr/local/greenplum-db/lib/postgresql/java/
    

    Set your classpath variable inside Greenplum:

    gpconfig -c pljava_classpath -v \'analytics.jar\'
    

    Apply the configuration change:

    gpstop -u
    

    Install the pljava language and in this example, I’m putting it into the gpdb database. Note: this is a one time activity per database.

    psql gpdb -f $GPHOME/share/postgresql/pljava/install.sql
    

    Make sure the classpath is set correctly:

    show pljava_classpath
    

    It should say ‘analytics.jar’.

    And finally, a working example:

    create table temp (a varchar) distributed randomly;
    insert into temp values ('a string');
    
    --Example function
    create or replace function java_substring(varchar, int, int)
    returns varchar as 'Example.substring' language java;
    
    --Example execution
    select java_substring(a, 1, 3) from temp;
    
    

    This was a pretty simple example of using PL/Java. Enjoy!

    How Do I Prevent Blocking Locks in Greenplum?

    This is the fifth in a series of “How do I do X in Greenplum database” and today’s topic is How do I prevent “blocking locks” in Greenplum?

    Since different databases handle read consistency and locking differently, I will discuss how Oracle, Microsoft SQL Server, PostgreSQL, and finally Greenplum handle locking. This will give you a good way to compare and contrast the different products.

    Oracle
    Oracle handles read consistency with using “rollback”. When you update a table, the row is overwritten but first, the old version of the data is put into a rollback segment and the row is marked with a timestamp and a pointer to the old version. This allows users to query the database and not be blocked while an update happens.

    If a query started at time1 and and update happens at time2, the query will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.

    Oracle Problems
    There are a couple of problems with this solution. The first is Disk IO because the database has to first read the data file and then read the rollback segment.

    The second problem with this is the rollback segment must be sized pretty large in a warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries if you are actively changing data.

    SQL Server
    SQL Server uses locking by default to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Yuck. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database behave more like Oracle but it uses TempDB instead of rollback for the old version of the rows updated.

    SQL Server Problems
    To get around locking, developers will use “read uncommitted” which is a dirty read. I never use a dirty read because it is a hack. Don’t do it.

    Using RCSI uses TempDB so more IO is involved. Unfortunately, I also almost never see it used in companies. Most SQL Server shops don’t even know this option is available so they continue with their dirty reads.

    PostgreSQL
    PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle read consistency. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

    To a user, PostgreSQL behaves similarly to Oracle but it doesn’t use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn’t use rollback, you will never get a “snapshot too old” error on long running queries.

    PostgreSQL Problems
    The downside to PostgreSQL happens when an update affects a large percentage of a table. If you update an entire table, it effectively doubles the size on disk. To make the old rows available again, a “vacuum” will need to be run so that new updates and/or inserts will use the deleted rows. PostgreSQL uses a background daemon to automatically vacuum tables so this usually isn’t a big deal.

    Greenplum
    Greenplum is a fork of PostgreSQL 8.2 so it uses MVCC to handle read consistency. Readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

    Greenplum MVCC Exception
    Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.

    Summary
    Greenplum Database provides read consistency with Multi-Version Concurrency Control (MVCC) similarly to PostgreSQL but update and delete operations are serialized. You don’t need to worry about blocking locks for selecting data because MVCC uses snapshots of the database instead of blocking locks. You can get a blocking lock on update and delete when another update or delete command is executing on the same table but readers aren’t blocked! You will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data.

    How Do I Tune Greenplum?

    This is the fourth in a series of “How do I do X in Greenplum database” and today’s topic is How do I tune Greenplum?

    I’ve been a DBA and Developer using Oracle, SQL Server, Sybase, PostgreSQL, and Greenplum. By far, “tuning” Greenplum is much easier than the other databases and the main reason why is the architecture of Greenplum versus an OLTP database. With those other databases, I always found myself looking at the top 20 worst SQL queries and then doing three things. First, tell the users that they write “bad” SQL. Second, I would create more and more indexes to handle the variety of queries. Lastly, I would redesign my dimensional model to handle how users really wanted to look at the data.

    Data Model
    Greenplum is data model agnostic. You can use your OLTP data model directly in the database or a 3NF model or a star schema or even a very flat single table. Some databases force you into a particular data model which isn’t agile at all. It even forces waterfall development because it takes a lot of time to design those star schemas!

    So now, my steps to tune Greenplum!

    Step 1: Distribution
    Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a distribution key defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns. If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this the “skew”.

    Here is how to check the distribution or skew of a table:

    SELECT gp_segment_id, count(*)
    FROM schema.table
    GROUP BY gp_segment_id;
    

    For most tables, this is all of the tuning that is needed. It is unusual that all of the tables in a database are very large. Usually there are a few large tables along with a large number of medium and small tables.

    Step 2: Partitioning
    For your larger tables, you will want to partition the tables to eliminate Disk IO. Greenplum supports list, range, and mutli-level partitioning.

    CREATE TABLE sales (trans_id int, date date, amount decimal(9,2), region text) 
    DISTRIBUTED BY (trans_id)
    PARTITION BY RANGE (date)
    SUBPARTITION BY LIST (region)
    SUBPARTITION TEMPLATE (
        SUBPARTITION usa VALUES ('usa'), 
        SUBPARTITION asia VALUES ('asia'), 
        SUBPARTITION europe VALUES ('europe'),
        DEFAULT SUBPARTITION other_regions)
    (START (date '2008-01-01') INCLUSIVE 
    END (date '2009-01-01') EXCLUSIVE EVERY (INTERVAL '1 month'),
    DEFAULT PARTITION outlying_dates );
    

    The Greenplum optimizer will eliminate partitions based on the SQL executed. Again, partitioning is for your larger tables and won’t help very much for smaller tables.

    Step 3: Compression and Column Orientation
    Greenplum supports both Row and Column oriented tables (and partitions). For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress this. It is basically another strategy to reduce Disk IO.

    Here is an example:

    CREATE TABLE my_customer 
    WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz) AS
    SELECT customer_id, fname, lname,
    address1, address2, city, state, zip,
    create_date, bill_amount
    FROM customer
    WHERE type_cd = 'xyz'
    DISTRIBUTED BY (customer_id);
    

    I only do this for my large tables. It isn’t worth it to do this unless the tables are pretty large.

    Step 4: Indexes
    Last but not least are indexes. I rarely use an index except for enforcing a Primary Key. Greenplum supports B-Tree, Bitmap, and GiST indexes.

    – B-Tree indexes can be unique or not.
    – Bitmap indexes are good useful when there are 100 to 100,000 distinct values.
    – GiST indexes are used to support Postgis.

    Again, I rarely use indexes but there are cases when it is helpful. It is the last in the line of things to do to “tune” your database.

    Summary
    Tuning in Greenplum doesn’t mean looking at lots of queries and instead it means checking table distribution. Then for your larger tables (10% or less of your database probably) Partition, Compress, and make Column oriented. Then for maybe 1 or 2 tables, use an Index.

    Gone are the days of looking at the “bad” queries. No more creating indexes for everything under the sun. No more waterfall data modeling projects either! Just load the data and follow my four tuning steps.

    How Do I Determine The Language For A Function?

    This is the third in a series of “How do I do X in Greenplum database” and today’s topic is How do I determine the language for a function.

    Inline SQL
    A common way to use a function is to transform data directly in the SELECT portion of a query. Here is a quick example.

    CREATE TABLE person
    (id int NOT NULL,
     fname text,
     lname text,
     company_name text)
    DISTRIBUTED BY (id);
    
    --sample data
    INSERT INTO person VALUES (1, 'Jon', 'Roberts', null);
    INSERT INTO person VALUES (2, null, 'Smith', null);
    INSERT INTO person VALUES (3, null, null, 'ABC Company');
    
    

    Now the use case is to retrieve the “name” of the person but it might be a combination of the three text columns. A business decision was made to do format the name as follows:

    SELECT CASE WHEN company_name IS NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN lname || ', ' || fname
                WHEN fname IS NULL AND lname IS NOT NULL THEN lname
                WHEN company_name IS NOT NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN company_name || ': ' || lname || ', ' || fname
                WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS NOT NULL THEN company_name || ': ' || lname 
                WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS  NULL THEN company_name END as name
    FROM person;
    
         name     
    --------------
     Roberts, Jon
     ABC Company
     Smith
    (3 rows)
    

    It would be much easier and consistent to add this logic to a function to be executed so it centralizes the logic to a single place and makes writing SQL a bit easier. This is when using the language “SQL” is preferred. Here is an example and notice that parameters are referred to $1, $2, $3 rather than the name of the parameter. The SQL language does not support named variables.

    CREATE OR REPLACE function fn_get_name(p_fname text, p_lname text, p_company_name text) RETURNS text AS
    $$
    SELECT CASE WHEN $3 IS NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $2 || ', ' || $1
                WHEN $1 IS NULL AND $2 IS NOT NULL THEN $2
                WHEN $3 IS NOT NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 || ', ' || $1
                WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 
                WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS  NULL THEN $3 END as name
    $$
    LANGUAGE sql;
    

    And now using this function.

    SELECT fn_get_name(fname, lname, company_name) AS name FROM person;
         name     
    --------------
     Roberts, Jon
     ABC Company
     Smith
    (3 rows)
    

    Transformation
    The next use case is when transforming data and PL/pgSQL does a great job at this. You can reference parmeters by name, declare variables, handle errors, etc. The code looks a lot like Oracle PL/SQL too.

    CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
    $$
    DECLARE
        v_rowcount int;
    BEGIN
        UPDATE person
        SET fname = initcap(fname),
            lname = initcap(lname),
            company_name = initcap(company_name);
    
        GET DIAGNOSTICS v_rowcount = ROW_COUNT;
    
        RAISE INFO '% Rows Updated.', v_rowcount;
    
    END;
    $$
    LANGUAGE plpgsql VOLATILE;
    

    Notice how I’m not looping through the data or issuing a commit after so many records. There are two reasons for this. First, Greenplum is designed for Big Data so manipulating lots of data in a single transaction is ideal. Secondly, a function in Greenplum is executed in a single transaction. You can’t create a transaction within a function because executing a function is by default in a transaction.

    Summary
    I hope this helps. When transforming data inline, use the SQL language and use the more robust PL/pgSQL language when transforming data.

    How Do I Create Dynamic SQL in Greenplum Database?

    This is the second in a series of “How do I do X in Greenplum database” and today’s topic is How do I create dynamic SQL.

    Method 1: psql
    The psql program is a command line program that is similar to isql, sqlcmd, or sqlplus. You can use it to dynamically create SQL statements to execute and then pipe it to another instance of psql to execute each statement in a separate transaction.

    Example:
    In this example, I will create a script dynamically that will analyze each table in the retail schema.

    psql -A -t -c "SELECT 'ANALYZE ' || table_schema || '.' || table_name || ';' FROM information_schema.tables WHERE table_schema = 'retail'" | psql -a
    ANALYZE retail.d_customers;
    ANALYZE
    ANALYZE retail.d_payment_methods;
    ANALYZE
    ANALYZE retail.d_products;
    ANALYZE
    ANALYZE retail.f_sales;
    ANALYZE
    

    psql is executed from the shell with -A (unaligned table output mode), -t (print rows only), and -c (command to execute). The Dynamic SQL is generated and then executed by using a | and psql -a. The -a means that the commands generated by the first psql program are echoed and executed.

    This is a nice feature when you need to have each statement executed in a separate transaction.

    Method 2: PL/pgSQL
    The second way of doing this same task is in PL/pgSQL.

    CREATE OR REPLACE FUNCTION retail.fn_analyze() RETURNS void AS
    $$
    
    DECLARE
        v_function_name text := 'retail.fn_analyze';
        v_location int;
        v_rec record;
    
    BEGIN
        v_location := 1000;
        FOR v_rec IN SELECT 'analyze ' || table_schema || '.' || table_name AS sql
                     FROM information_schema.tables 
                     WHERE table_schema = 'retail' LOOP
    
            RAISE INFO '%: %', clock_timestamp(), v_rec.sql;        
            EXECUTE v_rec.sql;
    
        END LOOP;
        
    EXCEPTION
        WHEN OTHERS THEN
            RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
    END;
    
    $$
    LANGUAGE plpgsql;
    

    And the output of the execution.

    psql -c "select retail.fn_analyze()"
    INFO:  2012-09-11 11:21:45.580327-05: analyze retail.d_customers
    INFO:  2012-09-11 11:21:46.102631-05: analyze retail.d_payment_methods
    INFO:  2012-09-11 11:21:46.225238-05: analyze retail.d_products
    INFO:  2012-09-11 11:21:46.556235-05: analyze retail.f_sales
     fn_analyze 
    ------------
     
    (1 row)
    

    Note that all of the work done in a function is in a single transaction. If there is a problem and the function fails, all of the work done up to that point will be rolled back. So in the case of analyzing tables, the psql technique is probably a better choice but for other use cases, the function technique is better.

    How Do I Get The Number Of Rows Inserted?

    This is the first in a series of “How do I do X in Greenplum database” and today’s first topic is How do I get the number of rows Inserted (or Updated/Deleted).

    Knowing the number of rows Inserted, Updated, or Deleted is pretty common in database programming.  You may want to key off of this to determine the next step in a procedure or just capture the number of rows in an ETL process.

    Sybase and Microsoft SQL Server use a Global Variable to do this which is @@ROWCOUNT.  It is dynamically populated after each statement execution.  In Oracle, you can use SQL%ROWCOUNT to do the same thing.

    In Greenplum and in a PL/pgSQL function, you can get the rowcount by using GET DIAGNOSTICS.  Here is an example.

    CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
    $$
    DECLARE
        v_rowcount int;
    BEGIN
        UPDATE person
        SET state_code = UPPER(state_code);
    
        GET DIAGNOSTICS v_rowcount = ROW_COUNT;
    
        RAISE INFO '% Rows Updated.', v_rowcount;
    
    END;
    $$
    language plpgsql;
    

    There you go. Pretty simple to get the ROW_COUNT in Greenplum.

    Myth: COUNT(*) is slower than COUNT(1) or COUNT(column name)

    There is a strongly held myth by some that in Greenplum, COUNT(*) is somehow slower than COUNT(1) or COUNT(column name).  However, it is a myth and there isn’t a difference.  If you are interested for proof, keep reading.

    The 4 arguments I’ve heard that COUNT(*) is bad and COUNT(1) or COUNT(column name) is better are:
    1. COUNT(*) is slower
    2. COUNT(1) is faster when there is an index on a column
    3. COUNT(column name) is faster when you count a unique column
    4. COUNT(1) is faster when the table is column oriented

    1. COUNT(*) is slower

    Setup

    create schema demo;
    
    create table demo.foo
    (id int not null,
    fname text,
    lname text)
    distributed by (id);
    
    insert into demo.foo
    select i, 'jon' || i, 'roberts' || i
    from (select generate_series(1, 1000000) as i) as sub;
    
    analyze demo.foo;

    COUNT(*)

    explain analyze select count(*) from demo.foo;
    
    Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
      Rows out:  1 rows with 483 ms to end, start offset by 0.474 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
            Rows out:  2 rows at destination with 481 ms to first row, 483 ms to end, start offset by 0.486 ms.
            ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 483 ms to end, start offset by 0.908 ms.
                  ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.071 ms to first row, 398 ms to end, start offset by 0.908 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 200K bytes avg x 2 workers, 200K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 484.379 ms

    COUNT(1)

    explain analyze select count(1) from demo.foo;
    
    Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
      Rows out:  1 rows with 504 ms to end, start offset by 0.532 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
            Rows out:  2 rows at destination with 502 ms to first row, 504 ms to end, start offset by 0.564 ms.
            ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 499 ms to end, start offset by 4.029 ms.
                  ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 398 ms to end, start offset by 4.029 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 505.237 ms

    Conclusion: No difference.

    2. COUNT(1) is faster when there is an index on a column

    COUNT(*) with PK Index

    alter table demo.foo add primary key (id);
    
    analyze demo.foo;
    
    explain analyze select count(*) from demo.foo;
    
    Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
      Rows out:  1 rows with 511 ms to end, start offset by 0.459 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
            Rows out:  2 rows at destination with 511 ms to end, start offset by 0.471 ms.
            ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 510 ms to end, start offset by 0.836 ms.
                  ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.039 ms to first row, 420 ms to end, start offset by 0.837 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 511.958 ms

    COUNT(1) with PK Index

    explain analyze select count(1) from demo.foo;
    
    Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
      Rows out:  1 rows with 511 ms to end, start offset by 0.563 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
            Rows out:  2 rows at destination with 507 ms to first row, 511 ms to end, start offset by 0.596 ms.
            ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 507 ms to end, start offset by 1.022 ms.
                  ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.040 ms to first row, 404 ms to end, start offset by 1.023 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 511.875 ms

    Conclusion: No difference.  The cost went down slightly with an Index but COUNT(*) versus COUNT(1) didn’t make a difference.

    3. COUNT(column name) is faster when you count a unique column

    COUNT(ID)

    explain analyze select count(id) from demo.foo;
    
    Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
      Rows out:  1 rows with 533 ms to end, start offset by 0.609 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
            Rows out:  2 rows at destination with 531 ms to first row, 533 ms to end, start offset by 0.647 ms.
            ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 529 ms to end, start offset by 4.696 ms.
                  ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=4)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 418 ms to end, start offset by 4.697 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 533.647 ms

    Conclusion: No difference

    4. COUNT(1) is faster when the table is column oriented

    Setup

    create table demo.foo_column
    (id int not null,
     fname text,
     lname text)
     WITH (appendonly=true, orientation=column)
     distributed by (id);
    
    insert into demo.foo_column
    select i, 'jon' || i, 'roberts' || i
    from (select generate_series(1, 1000000) as i) as sub;
    
    analyze demo.foo_column;

    COUNT(*)

    explain analyze select count(*) from demo.foo_column;
    Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
      Rows out:  1 rows with 368 ms to end, start offset by 0.544 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
            Rows out:  2 rows at destination with 362 ms to first row, 368 ms to end, start offset by 0.573 ms.
            ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 359 ms to end, start offset by 4.160 ms.
                  ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.182 ms to first row, 266 ms to end, start offset by 4.160 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 368.748 ms

    COUNT(1)

    explain analyze select count(1) from demo.foo_column;
    
    Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
      Rows out:  1 rows with 277 ms to end, start offset by 0.557 ms.
      ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
            Rows out:  2 rows at destination with 204 ms to first row, 277 ms to end, start offset by 0.586 ms.
            ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
                  Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 273 ms to end, start offset by 4.296 ms.
                  ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                        Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.195 ms to first row, 174 ms to end, start offset by 4.297 ms.
    Slice statistics:
      (slice0)    Executor memory: 152K bytes.
      (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
    Statement statistics:
      Memory used: 128000K bytes
    Total runtime: 277.571 ms

    Conclusion: No difference. The cost is lower to count the number of rows in a column oriented table but again, makes no difference if it is COUNT(1) or COUNT(*).

    Summary

    Don’t use silly code tricks like COUNT(1) because it doesn’t make any difference.

    Chorus

    I recently dug into Greenplum Chorus to better understand the collaboration tool.  I installed it on my Mac with my Greenplum Database, Greenplum Hadoop, and Greenplum Command Center.  In this post, I will review how Chorus was designed and how it facilitates collaboration in an organization.
    Greenplum Chorus

    Chorus Features
    First off, the image above shows Greenplum Chorus, Database, and Hadoop. I created the image to better understand the product so this isn’t a marketing slide given to me.

    Starting with Chorus, you have these main features.

    • Scheduler: Used to refresh data in Sandbox (more on this below)
    • Data Definitions: Connections to Greenplum databases and Hadoop. You can browse the data in both databases and Hadoop and then link this to Workspaces.
    • Insights: Comments made in Chorus that are so important to the business, it is shared to everyone! No longer will revelations about data be isolated to a single person or a small group. Now you can share your insights with others!

    Workspace Features

    Next we have Workspaces.  In my image, I have a “Sales” Workspace as well as other Workspaces like HR, Marketing, Analytics, and Accounting.  This is where people work together around data and as you can see, the Chorus Administrator can create as many Workspaces as needed for your organization.  Each Workspace have these features:

    • Linked Data: Data from Greenplum databases and Hadoop that is linked to the workspace. This makes it simple to query this data in Chorus without needing to make a copy of the data.
    • Sandbox Data: This is either data copied from Greenplum databases using the scheduler or completely new tables derived using SQL you write. This is very powerful and goes a long way to providing business users the ability to find new value in data.
    • Chorus View: This is similar to a database view but the definition resides in Chorus.
    • Files: This is basically metadata (Text, Images, other) plus code (SQL) that is stored in the workspace. Versioning is done automatically too. You can execute the SQL directly in Chorus. Very powerful stuff.
    • Notes: This is more metadata about anything in your workspace.  Notes are also what can be promoted to an Insight for the entire organization.  You are building a living and breathing data dictionary with Chorus!

    Visualization

    Workspaces also have the ability to visualize data. This is done with graphing similar to a BI tool. The visualization is there to quickly understand the data and then take action on it. Maybe the action is to write a Note or Insight or might be to further investigate with additional queries to Greenplum database and Hadoop.  Chorus isn’t meant to replace reporting tools.  Instead, the aim is to quickly understand the data and then take action with a Note, Insight, and/or more investigation.

    Security

    Logging into Chorus can be handled by LDAP/Active Directory if you like.  Hadoop and Database connections can be made public to the Chorus users or you can require users to log into the data sources so security is handled by the Database and Hadoop.

    Summary

    Chorus is a great collaboration tool for Greenplum.  I am very impressed with the tool and expect more great things from the product in the future.

    Creating a Date Dimension

    Here is another technique worth mentioning in regards to building a date dimension.  One way to do this is to create a loop and iterate over this to insert the records.  This is slow in virtually every database and we have a way to do this with a single INSERT statement.

    First create the table:

    DROP TABLE if exists hr.date_dim;
    
    CREATE TABLE hr.date_dim
    (
      date_key timestamp without time zone NOT NULL,
      date_week timestamp without time zone NOT NULL,
      date_month timestamp without time zone NOT NULL,
      date_quarter timestamp without time zone NOT NULL,
      date_year timestamp without time zone NOT NULL
    )
    DISTRIBUTED BY (date_key);

    And now the magic:

    INSERT INTO hr.date_dim
    SELECT  mydate AS date_key,
            date_trunc('week', mydate) AS date_week,
            date_trunc('month', mydate) AS date_month,
            date_trunc('quarter', mydate) AS date_quarter,
            date_trunc('year', mydate) AS date_year
    FROM    (
            SELECT '1995-01-01'::timestamp + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM '2007-04-16'::timestamp - '1995-01-01'::timestamp)::int))) AS mydate
            ) AS sub;

    My example builds the date dimension with every day populated between January 1, 1995 through April 16, 2007.  I did this as it was the minimum and maximum dates in my HR table with job history information.

    This solution is similar to the other post I made about using generate_series to avoid a nested loop because I’m using generate_series again.  It is a very powerful and easy way to dynamically create a dataset without using a loop to do so.

    Nested Loops, BETWEEN, and generate_series

    Here is a common scenario in a data warehouse. You have designed yourself a date dimension and you have another table with a begin and end time for some activity.

    The date dimension might look something like this:

    CREATE TABLE hr.date_dim
    (
    date_key timestamp without time zone NOT NULL,
    date_week timestamp without time zone NOT NULL,
    date_month timestamp without time zone NOT NULL,
    date_quarter timestamp without time zone NOT NULL,
    date_year timestamp without time zone NOT NULL
    )
    DISTRIBUTED BY (date_key);

    And you activity table would look something like this:

    CREATE TABLE hr.job_history
    (
    employee_id numeric NOT NULL,
    start_date timestamp without time zone NOT NULL,
    end_date timestamp without time zone,
    job_id character varying(10),
    department_id numeric,
    CONSTRAINT job_history_pkey PRIMARY KEY (employee_id, start_date)
    )
    DISTRIBUTED BY (employee_id, start_date);

    One way to write the query joining the two tables is with BETWEEN. Here is what it looks like:

    SELECT *
    FROM hr.job_history jh
    JOIN hr.date_dim d ON d.date_key BETWEEN jh.start_date AND jh.end_date;
    

    The explain plan looks like this on my local install on my Mac:

    Gather Motion 2:1  (slice2; segments: 2)  (cost=2.42..2301.81 rows=2494 width=78)
      ->  Nested Loop  (cost=2.42..2301.81 rows=2494 width=78)
            Join Filter: d.date_key >= jh.start_date AND d.date_key   Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
            ->  Materialize  (cost=2.42..2.62 rows=10 width=38)
                  ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.40 rows=10 width=38)
                        ->  Seq Scan on job_history jh  (cost=0.00..2.10 rows=5 width=38)

    Notice it is doing a Nested Loop operation. Yuck.

    And here is the one that utilizes generate_series:

    SELECT *
    FROM (
    SELECT *, start_date + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM end_date - start_date)::int))) AS mydate
    FROM hr.job_history
    ) AS jh
    JOIN hr.date_dim d ON jh.mydate = d.date_key;

    Look at the explain plan now!

    Gather Motion 2:1  (slice2; segments: 2)  (cost=2.70..68.96 rows=7 width=166)
      ->  Hash Join  (cost=2.70..68.96 rows=7 width=166)
            Hash Cond: d.date_key = jh.mydate
            ->  Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
            ->  Hash  (cost=2.58..2.58 rows=5 width=126)
                  ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.58 rows=5 width=126)
                        Hash Key: jh.mydate
                        ->  Result  (cost=0.00..2.27 rows=5 width=38)
                              ->  Seq Scan on job_history  (cost=0.00..2.27 rows=5 width=38)

    It is doing a Hash Join and the cost has dropped significantly.

    The lesson here is to avoid Nested Loops in Greenplum and one way to avoid this is to use an equal join instead of between. To achieve this, you can use the generate_series function in Greenplum as demonstrated here.

    Dynamic External Tables

    A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.

    For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files.  I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.

    Greenplum Table:

    CREATE TABLE myfiles
    (filename TEXT NOT NULL PRIMARY KEY,
    dir TEXT NOT NULL)
    DISTRIBUTED BY (filename);
    
    INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');

    SQL Server Table:

    CREATE TABLE myfiles
    (filename VARCHAR(500) NOT NULL PRIMARY KEY,
     dir VARCHAR(MAX) NOT NULL);
    
    INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
    --Note that these three rows don't exist in Greenplum
    INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump');
    INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');

    Shell Script: /Users/gpadmin/get_myfiles.sh
    Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.

    max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"`
    java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"

    Dynamic External Web Table in Greenplum

    CREATE EXTERNAL WEB TABLE ext_myfiles
    (filename TEXT,
     dir TEXT)
    EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER
    FORMAT 'text' (delimiter '|' null 'null' escape E'\\')
    ENCODING 'UTF8';

    Insert statement in Greenplum

    INSERT INTO myfiles SELECT * FROM ext_myfiles;

    This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.

    You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.

    Killing Idle Sessions

    A lot of users will open a connection to a database and then leave it open for days.  They will forget about this and then open more connections.  Eventually, they will hit the max_connections limit and then not be able to create any new connections to Greenplum.

    An easy fix for this is to kill sessions programmatically that have been idle for more than an hour.  Of course, you can adjust this time period as appropriate.

    Here is a simple SQL statement to identify the idle sessions:

    SELECT procpid
    FROM pg_stat_activity
    WHERE current_query = '<IDLE>'
    AND clock_timestamp() - query_start > interval '1 hour';

    One way to kill these idle sessions is with a single command:

    eval `psql -A -t -c "SELECT 'kill ' || procpid FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'"`

    Put that command into a shell script (be sure to source the greenplum_path.sh script first) and create a cron job that runs every 5 minutes.

    If you want to track who you killed and what time, you can use this shell script and again, use a cron job to execute it every 5 minutes.

    #!/bin/bash
    source /usr/local/greenplum-db/greenplum_path.sh
    for i in $( psql -A -t -c "SELECT procpid, usename FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'" ); do
            procpid=$(echo $i | cut -d\| -f1)
            usename=$(echo $i | cut -d\| -f2)
    
            echo `date`\|$procpid\|$usename >> kill.log
            kill $procpid
    done

    Rolling partitions with Polymorhpic Storage

    Greenplum’s Polymorhpic storage is a very useful feature in Greenplum database where tables can be created and altered with a mixture of columnar, compressed, and row storage.

    Polymorphic Storage

    Here is an example of creating one of these tables and then managing this over time.

    CREATE TABLE edw.sales_fact
    (
      store_id integer,
      sales_person_id integer,
      product_id integer,
      sale_date date,
      sale_price numeric,
      product_cost numeric,
      spiff numeric
    )
    DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date)
    PARTITION BY RANGE(sale_date) 
            (
            START ('2011-01-01'::timestamp) INCLUSIVE END ('2011-02-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
            START ('2011-02-01'::timestamp) INCLUSIVE END ('2011-03-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=quicklz),
            START ('2011-03-01'::timestamp) INCLUSIVE END ('2011-04-01'::timestamp) EXCLUSIVE
            )
    ;

    As you can see, this table has three partitions with each having different storage techniques.

    • January 2011 is columnar and compressed with ZLib
    • February 2011 is columnar and compressed with QuickLZ
    • March 2011 is stored as a row store and is not compressed

    As time goes on, the DBA may want to make March 2011 compressed and then add a partition for April 2011. To do this, we first need to move the March 2011 data to a new table and exchange the partition. The last step is to add the new April 2011 partition.

    CREATE TABLE edw.sales_fact_march
    (
      store_id integer,
      sales_person_id integer,
      product_id integer,
      sale_date date,
      sale_price numeric,
      product_cost numeric,
      spiff numeric
    )
    WITH (appendonly=true, orientation=column, compresstype=quicklz)
    DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date);
    
    INSERT INTO edw.sales_fact_march 
    SELECT * FROM edw.sales_fact 
    WHERE sale_date >= '2011-03-01' 
    AND sale_date < '2011-04-01';
    
    ANALYZE edw.sales_fact_march;

    This next step is optional but you can now TRUNCATE the March 2011 partition because all of the data is in the sales_fact_march table. Plus, we need to exchange the two and it is faster to exchange with a truncated table.

    TRUNCATE TABLE edw.sales_fact_1_prt_3;

    Note: I got the partition table name from pg_partitions.

    Now exchange my row stored partition for the columnar and compressed partition.

    ALTER TABLE edw.sales_fact 
    EXCHANGE PARTITION FOR (RANK(3)) 
    WITH TABLE edw.sales_fact_march 
    WITH VALIDATION;

    Note: I got the RANK from pg_partitions.

    We no longer need the temporary March 2011 data so we can drop this table.

    DROP TABLE edw.sales_fact_march;

    Lastly, add an empty April 2011 partition.

    ALTER TABLE edw.sales_fact 
    ADD PARTITION START ('2011-04-01'::timestamp) INCLUSIVE 
    END ('2011-05-01'::timestamp) EXCLUSIVE;

    The new DDL for the table looks like this:

    CREATE TABLE edw.sales_fact
    (
      store_id integer,
      sales_person_id integer,
      product_id integer,
      sale_date date,
      sale_price numeric,
      product_cost numeric,
      spiff numeric
    )
    WITH (
      OIDS=FALSE
    )
    DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date)
    PARTITION BY RANGE(sale_date) 
              (
              START ('2011-01-01'::date) END ('2011-02-01'::date) WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5) , 
              START ('2011-02-01'::date) END ('2011-03-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), 
              START ('2011-03-01'::date) END ('2011-04-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), 
              START ('2011-04-01'::date) END ('2011-05-01'::date)
              )
    ;

    Loading data with External Tables and gpfdist

    Straight from the Greenplum Admin Guide:

    gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.

    In this post, I’m going to demonstrate how to use a readable external table to load data that uses gpfdist. This makes it easy for Greenplum users to load massive amounts of data by just writing SQL!

    First, start gpfdist as a background process.

    gpfdist -d /Users/gpadmin -p 8080 &

    Create my Table in Greenplum.

    CREATE TABLE foo
    (bar_id int not null,
     bar text,
     bar_description text)
    DISTRIBUTED BY (bar_id);

    Now create my External Table.

    CREATE EXTERNAL TABLE ext_foo
    (LIKE foo) LOCATION ('gpfdist://localhost:8080/demo/foo.txt')
    FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
    • I could have spelled out all of the columns but I used a shortcut by using “LIKE foo”
    • The location indicates it is using gpfdist and the host is local with port 8080
    • My file is located /Users/gpadmin/demo/foo.txt but gpfdist is serving /Users/gpadmin so I need to only specify the demo subdirectory and then the file name
    • You can pick TEXT of CSV format but TEXT will execute faster
    • I used pipes as my delimiter and spelled out null for my null values
    Insert the data.
    INSERT INTO foo SELECT * FROM ext_foo;

    You are done!

    Other tips:

    • When creating the External Table, you can also specify an error_table (created dynamically if not found) with a reject limit.
    LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS
    • If you have a really large file to load, you can split the file into multiple files and then specify multiple locations in the External Table.  You can even specify different hosts for each file with multiple gpfdists running at once.

    And remember, that inserting data this way pushes the data directly to the segments.  It bypasses the MASTER server so it scales much better than other database’s loading utilities.  Greenplum can load 10TB/Hour for a full rack DCA when using gpfdist!