Disk Games

I’ve been working in the cloud for a while now and have learned a lot about optimizing the disk performance. This quick block post will cover some of the things I’ve learned in the cloud.

Blockdev

On-premise installations are typically up 24×7 but in the cloud, you may frequently want to pause your cluster to save money. Greenplum on the cloud has gppower which automates the pause and resume of your cluster so you won’t have to mess with the blockdev command.

The installation guide for Greenplum will advise you to set the read-ahead value to 16384 for each data volume in your cluster.

/sbin/blockdev --setra 16384 $device

Did you know that on reboot, your operating system will reset this back to the default? So, you will need to set the read-ahead after every reboot. This is done automatically for you in the cloud but if you build your own, be sure to add some automation to do this after starting the nodes in your cluster.

UUID

In your /etc/fstab file, you can define the mounts like so:

/dev/sdb /data1 xfs rw,noatime,nobarrier,nodev,inode64 0 2
/dev/sdc /data2 xfs rw,noatime,nobarrier,nodev,inode64 0 2
/dev/sdd /data3 xfs rw,noatime,nobarrier,nodev,inode64 0 2

Unfortunately, you may run into problems with this configuration. The device order may change on reboot so suddenly, the disk that is supposed to be data1 is now data2. This will prevent the database from starting properly.

Instead, you should use the UUID which can be found in /dev/disk/by-uuid/ on each host.

UUID=46c6e872-20ca-4d50-9ed9-d98c4f662437 /data1 xfs rw,noatime,nobarrier,nodev,inode64 0 2
UUID=ed8b5c09-7427-455c-b132-f601d4f0b0d2 /data2 xfs rw,noatime,nobarrier,nodev,inode64 0 2
UUID=e92a5f17-3903-4c14-8e22-ebe218ab403f /data3 xfs rw,noatime,nobarrier,nodev,inode64 0 2

By using UUID, the device path may change but the UUID will persist. This ensures no problems in starting your database after the cluster was restarted.

noatime

This simple setting which is found in the /etc/fstab file, indicates to the operating system not to update the timestamp on a file if it is modified or accessed. There is considerable overhead added to using the database if you don’t add this mount option. In other words, it is faster to use noatime.

nobarrier

This mount option which is available in CentOS/Redhat 6 and 7 improves performance by not using write barriers. The overall increase in performance depends on your disk configuration but be advised that the disks need a battery backed disk write cache to ensure no data loss.

xfs

XFS has been the recommended disk format for over 10 years but did you know about the cool utility xfs_growfs that is included with this filesystem? This nifty tool will grow an existing xfs filesystem that is mounted. It is actually used in the Greenplum cloud utility gpgrow to grow the disk storage for your database in the cloud. It also can be executed online so you don’t have to stop the database to grow your storage.

Scheduler

The disk scheduler recommended for Greenplum is deadline. If you ever replace a disk for Greenplum, be sure to set the scheduler back to deadline. The default is likely cfq but this needs to be updated.

In the cloud, the gpsnap utility actually replaces your data volumes during a restore and will automatically set the scheduler properly but if you do a disk restore manually, be sure to set this for all disks replaced.

echo "deadline" > /sys/block/$devname/queue/scheduler

Number of Disks

Typically, when you add more disks to a node, you will get greater throughput but you will eventually reach the limit of the controller so adding more disks won’t mean more throughput.

For the cloud, things get a little bit more interesting. Each cloud vendor puts performance limits on each VM and compared to a bare metal installation, these limits are considerably lower.

So you may read about how a cloud vendor’s disk can provide 500MB/s of read performance so you might think with 4 disks, I could achieve 2000MB/s. However, that same cloud vendor may put a speed limit on each VM of only 500MB/s. So no matter how many disks you add to your cluster, you won’t get more throughput!

It is often better to use 2x to 8x more virtual machines that are smaller to get more overall disk throughput rather than using the larger instance types in the clouds. BTW, all of this work has already been figured out for the Greenplum cloud products. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.