Greenplum Scaling in the Cloud

Scaling On-Premise

An MPP database like Teradata, Redshift, and Greenplum have traditionally “scaled” by adding more physical nodes. The shared nothing architecture is great in that it gets near linear scalability. If you double the number of machines, you get nearly double the performance.

For Greenplum, adding more nodes is done with the help of a tool called gpexpand. This tool redistributes the data to newly added nodes. The database basically has to recreate the tables in the database to spread the data across all of the nodes in the cluster. This is a time consuming process but the database can be up while the expand occurs.

How long does it take? Well, you first have to buy more physical nodes, then rack it, run power and network. After all of that, you can finally run gpexpand which executes pretty quickly.

Scaling in the Cloud Option 1

With the cloud, provisioning a new VM is quick and easy so you can add more nodes and then run gpexpand. However, you still have to reshuffle the data which impacts the database while it is running. It also doesn’t give you a rollback plan either.

The easiest and safest option to add more nodes in the cloud is to create a new cluster and use gpcopy to move the data from the old cluster to the new one. This gives you the ability to validate that the larger cluster performs as expected before you switch over. It also gives you a rollback plan to go back to the old cluster. Once you are happy with the new cluster, you can delete it.

How long does it take? In the cloud, it takes no more than an hour to create a new cluster and transfer rates with gpcopy have been observed at 5TB to 10TB per hour.

Scaling in the Cloud Option 2

The other option in the cloud is available via the AWS, Azure, and GCP Marketplaces with a tool called gpcompute. This tool alters the instance type the cluster uses for the Segment Hosts to either increase or decrease the compute power. This executes in just a few minutes too.

The additional compute gives you the ability to handle higher concurrency workloads. You can then decrease the compute to save on the IaaS costs.

With gpcompute, you can better manage your IaaS costs while dynamically handling the demands on your database.

Cloud ProviderBring Your Own LicenseBilled Hourly
AWSBYOLHourly
GCPBYOLHourly
AzureBYOLHourly

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.