Has anyone been running kafka well on the st1 EBS volumes? We've historically run on m1 and m2 instance types for our Kafka workload but wanted to move to the M4s to get better price/performance.
We rolled out a single instance in two environments with M4 and 1 TB of st1. Everything seemed to be going well, lower cpu util, flush times looked good, etc. Then, a few hours later, both our Kafka flush times and CPU wait times went up much higher than the rest of the cluster and just stayed there. Looking at the cloudwatch metrics, it shows that our "burst balance" had been slowly degrading over those hours, and as soon as it was exhausted, that's when the elevated times happened. I could possibly believe that with the production workload, we were overwhelming some allocation, but in our staging environment, it makes no sense. Cloudwatch says that our average write size is 40 KB/op which I suspect is just far too low for the st1 as it's designed for large sequential writes. I believe we're eating up our IOPS allocation but I could be just looking at the completely wrong thing. We're using an XFS filesystem with a 16 M allocsize. Does anyone have experience with this volume type? I suspect we're just holding it wrong. Cheers, -Dave -- Dave Mangot Director of Operations Librato and Papertrail