In the public cloud Marketplaces, you can scale compute and memory in your cluster independently of storage with gpcompute. This automates the process to increase or decrease the instance type size to increase/decrease the amount of CPU and RAM for each VM.
By scaling up, you will be able to handle higher concurrency. Scaling down, you can save money. So it is a great utility to handle the Monday morning traffic in the database and then scale down for the remaining part of the week. Or, you may have a new cluster with few users. As you get more adoption, you can scale up to handle the increase of demand.
For storage, you can scale up your storage independently of compute with gpgrow. This automates the commands to increase the storage size of your cluster. For AWS and GCP, this command is an online command which means the database is up and running while you grow your storage. For Azure, the cluster is temporarily paused while it grows the storage.
For on-premise installations, you typically scale by adding more nodes and then use the gpexpand utility. This is the most practical way to expand as you just rack more nodes and then expand the cluster to take advantage of the new nodes.
In the cloud, you can provision a new cluster rather quickly so quickly that it makes using gpcopy the better way to add more nodes to you cluster.
I have a small cluster in AWS with 8 nodes (64 segment cores) and now have decided to double the cluster to 16 nodes (128 segment cores). For my test, I first loaded 1TB of TPC-H data into my old cluster. This is the Source cluster.
Next, I deployed the new AWS cluster with 16 nodes using the AWS Marketplace which only took 17 minutes. This is the Target cluster.
gpcopy Configuration Steps
- Deploy new cluster in a new VPC which requires configuring VPC peering. Note: If you are deploying the cluster to the same VPC, you won’t have to do these steps.
- Install gpcopy on both clusters.
- Stop Command Center on Target cluster.
- On Target cluster, update
gpmonusers’ passwords to match the Source cluster.
- On the Source cluster, update the gpadmin .pgpass file to have an entry for the private ip address of the Target cluster. Remember to use port 6432. Example:
- Run gpcopy on the Source cluster.
- Run analyzedb on Target cluster.
gpcopy --full \ --source-host mdw \ --source-port 6432 \ --dest-host 10.1.0.248 \ --dest-port 6432 \ --drop
[INFO]:-Total elapsed time: 15m10.61753093s [INFO]:-Total transferred data 1.2TB, transfer rate 9.2TB/h [INFO]:-Copied 2 databases [INFO]:-Database gpperfmon: successfully copied 34 tables, skipped 0 tables, failed 0 tables [INFO]:-Database dev: successfully copied 187 tables, skipped 0 tables, failed 0 tables [INFO]:-Copy completed successfully
The Source cluster ran one of the TPC-H queries in 1 minute and 15 seconds. The same query on the new cluster ran the query in 25 seconds. Another example, a query ran in 1 minute and 59 seconds and now runs in 33 seconds.
Benefits of gpcopy
- Validation that the new cluster performs as expected.
- Automatic Rollback plan. If I don’t like it, I delete the new cluster and keep using the old one.
- Secure. The data transfers between VMs routed through your private network.
The elasticity of the cloud provides new solutions in solving old problems. How can I scale Greenplum? The old way is to add more nodes and run gpexpand. With the cloud, there are three answers.
- Scale compute with gpcompute
- Scale storage with gpgrow
- Scale nodes with gpcopy