I recently spun up a 12-node Cassandra cluster in EC2 and, since it's a database, I decided to do some basic tire-kicking and learned a few things along the way.
Rule: always zero your ephemerals if you care about performance.
Why: Amazon is likely using sparse files to back ephemerals (and probably EBS, I have no experience there). This makes perfect sense, because:
- you get free thin provisioning, so unused disk doesn't go to waste
- Xen supports it well
- it's easy to manage lots & lots of them
- it's trivial to export over all common network block protocols (e.g. AoE, iSCSI)
Because there is an extra step of allocating a backing block for a sparse file for every block in your VM, performance will be all over the map while zeroing the disks.
I usually launch my zeroing script with cl-run.pl --list burnin -s zero-drives.sh. The "burnin" list is just all the ec2 hostnames, one per line, in ~/.dsh/machines.burnin.
Culling round 1: Look at the raw throughput of all of the nodes and cull anything that looks abnormally low. For example, when building the aforementioned cluster, I kept getting really obviously bad instances in one of the us-east-1 AZ's. This is what I saw when using my cluster netstat tool for a batch of m1.xlarge's in us-east-1c.
I immediately culled off everything doing under 10k IOPS for more than a minute. If you examine the per-disk stats with iostat -x 2, you'll usually see one disk with insanely high (>1000ms) latency all the time. There are certainly false-negatives at this phase, but I don't really care since instances are cheap and time is not. I ended up starting around 30 instances in that one troublesome AZ to find 3 with sustainable IOPS in the most trivial of tests (dd).
When I think I have enough obviously tolerable nodes for a race, I kick off another zero round. Once the load levels out a little, I take a snapshot I like of the cl-netstat.pl output and process it in a hacky way to sort by IOPS and add which EC2 zone the instance is in and its instance ID so I can kill the losers without digging around. Here's an example from a round of testing I did for a recent MySQL cluster deployment:
I picked the top few instances from each AZ and terminated the rest. Job done.
This is a pretty crude process in many ways. It's very manual, it requires a lot of human judgement, and most importantly, dd if=/dev/zero not a good measure of real-world performance. This process is just barely good enough to cull the worst offenders in EC2, which seem to be quite common in my recent experience.
In the future, I will likely automate most of this burn-in process and add some real-world I/O generation, probably using real data.