Monthly Archives: April 2020

Platform Extension Framework (PXF) – Azure Blob Store

Greenplum can read and write data to Azure blob store with PXF just like it can with AWS S3. Here is a quick demo on using Azure.

Step 1 – Start PXF

Run this as gpadmin on the Master (mdw) host.

pxf cluster start

Step 2 – Create Azure Storage Account

Log into Azure and navigate to Storage Accounts and create a new one.

Create New Storage Account

The Storage Account name must be unique within Azure. I picked “greenplumguru” for this demo but your name must be different.

Step 3 – Retrieve the Storage Account Key

After the Storage Account has been created, go to the resource and click on “Access Keys” to copy it.

Retrieve the “key”.

Step 4 – Configure PXF

On the master node (mdw), execute the following commands.

mkdir $PXF_CONF/servers/demo
cp $PXF_CONF/templates/wasbs-site.xml $PXF_CONF/servers/demo/
vi $PXF_CONF/servers/demo/wasbs-site.xml

Change YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME to the name of your Storage Account from Step 2. Change YOUR_AZURE_BLOB_STORAGE_ACCOUNT_KEY to the key value from Step 3.

Step 5 – Sync PXF

pxf cluster sync

Step 6 – Create a Container in your Storage Account

Click on Containers in your Storage Account
Create Container

Step 7 – Create Writable External Table

CREATE WRITABLE EXTERNAL TABLE ext_write
(id int, description text)
LOCATION ('pxf://demo-container@greenplumguru.blob.core.windows.net/path/to/blob/file?PROFILE=wasbs:text&SERVER=demo')
FORMAT 'csv';

In the above example, I used the following:

  • “greenplumguru” for the Storage Account from step 2
  • “demo” as the name of the server configuration from step 4
  • “demo-container” which is the container name from step 6

Step 8 – Insert Some Data

INSERT INTO ext_write SELECT i, 'foo_' || i 
FROM generate_series(1,100000) as i;

Step 9 – Create Readable External Table

CREATE EXTERNAL TABLE ext_read
(id int, description text)
LOCATION ('pxf://demo-container@greenplumguru.blob.core.windows.net/path/to/blob/file?PROFILE=wasbs:text&SERVER=demo')
FORMAT 'csv';

Step 10 – Select from External Table

SELECT * FROM ext_read LIMIT 10;

Platform Extension Framework (PXF) – S3

Greenplum has always used External Tables to access external data for query access. You can load this data into Greenplum formatted tables or even use as partitions in a partitioned table. The “location” of the External Table can be many things such as gpfdist (for files) or S3 or PXF.

PXF can access many formats like HDFS, HBase, Hive, JDBC, Google Cloud Storage, Minio, and AWS S3. This article will cover S3 but you can learn more about PXF here.

When you deploy Greenplum in AWS, an IAM::Role resource is created which enables many AWS features in the virtual machines deployed. You see this warning message bout the Role when you deploy.

IAM::Role acknowledgment

Because you have the Role by default in AWS, you can leverage these credentials immediately to access S3 via PXF. Here is how you do it.

Step 1 – Start PXF

Run this as gpadmin on the Master (mdw) host.

pxf cluster start

Step 2 – Create s3 bucket

Note: The bucket name must be unique.

Step 3 – Create Writable External Table

CREATE WRITABLE EXTERNAL TABLE ext_write
(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 4 – Insert Some Data

INSERT INTO ext_write SELECT i, 'foo_' || i 
FROM generate_series(1,100000) as i;

Step 5 – Create Readable External Table

CREATE EXTERNAL TABLE ext_read
(id int, description text)
LOCATION ('pxf://my-demo-bucket-greenplum/?PROFILE=s3:text&SERVER=s3')
FORMAT 'csv';

Step 6 – Select from External Table

SELECT * FROM ext_read LIMIT 10;

As you can see, with relative ease, you can create readable and writable external tables using PXF in AWS by leveraging the existing credentials that are included with your Greenplum on AWS deployment. Enjoy!

Greenplum Scaling in the Cloud

Scaling On-Premise

An MPP database like Teradata, Redshift, and Greenplum have traditionally “scaled” by adding more physical nodes. The shared nothing architecture is great in that it gets near linear scalability. If you double the number of machines, you get nearly double the performance.

For Greenplum, adding more nodes is done with the help of a tool called gpexpand. This tool redistributes the data to newly added nodes. The database basically has to recreate the tables in the database to spread the data across all of the nodes in the cluster. This is a time consuming process but the database can be up while the expand occurs.

How long does it take? Well, you first have to buy more physical nodes, then rack it, run power and network. After all of that, you can finally run gpexpand which executes pretty quickly.

Scaling in the Cloud Option 1

With the cloud, provisioning a new VM is quick and easy so you can add more nodes and then run gpexpand. However, you still have to reshuffle the data which impacts the database while it is running. It also doesn’t give you a rollback plan either.

The easiest and safest option to add more nodes in the cloud is to create a new cluster and use gpcopy to move the data from the old cluster to the new one. This gives you the ability to validate that the larger cluster performs as expected before you switch over. It also gives you a rollback plan to go back to the old cluster. Once you are happy with the new cluster, you can delete it.

How long does it take? In the cloud, it takes no more than an hour to create a new cluster and transfer rates with gpcopy have been observed at 5TB to 10TB per hour.

Scaling in the Cloud Option 2

The other option in the cloud is available via the AWS, Azure, and GCP Marketplaces with a tool called gpcompute. This tool alters the instance type the cluster uses for the Segment Hosts to either increase or decrease the compute power. This executes in just a few minutes too.

The additional compute gives you the ability to handle higher concurrency workloads. You can then decrease the compute to save on the IaaS costs.

With gpcompute, you can better manage your IaaS costs while dynamically handling the demands on your database.

Cloud ProviderBring Your Own LicenseBilled Hourly
AWSBYOLHourly
GCPBYOLHourly
AzureBYOLHourly