Platform Extension Framework (PXF) – S3

Greenplum has always used External Tables to access external data for query access. You can load this data into Greenplum formatted tables or even use as partitions in a partitioned table. The “location” of the External Table can be many things such as gpfdist (for files) or S3 or PXF.

PXF can access many formats like HDFS, HBase, Hive, JDBC, Google Cloud Storage, Minio, and AWS S3. This article will cover S3 but you can learn more about PXF here.

When you deploy Greenplum in AWS, an IAM::Role resource is created which enables many AWS features in the virtual machines deployed. You see this warning message bout the Role when you deploy.

IAM::Role acknowledgment

Because you have the Role by default in AWS, you can leverage these credentials immediately to access S3 via PXF. Here is how you do it.

Step 1 – Start PXF

Run this as gpadmin on the Master (mdw) host.

pxf cluster start

Step 2 – Create s3 bucket

Note: The bucket name must be unique.

Step 3 – Create Writable External Table

(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 4 – Insert Some Data

INSERT INTO ext_write SELECT i, 'foo_' || i 
FROM generate_series(1,100000) as i;

Step 5 – Create Readable External Table

(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 6 – Select from External Table

SELECT * FROM ext_read LIMIT 10;

As you can see, with relative ease, you can create readable and writable external tables using PXF in AWS by leveraging the existing credentials that are included with your Greenplum on AWS deployment. Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.