Hi, We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge based on datastax AMI with 4 storage in raid0 mode.
Here is the ticket we opened with amazon support : "This raid is created using the datastax public AMI : ami-b2212dc6. Sources are also available here : https://github.com/riptano/ComboAMI As you can see in the screenshot attached ( http://imageshack.com/a/img854/4592/xbqc.jpg) randomly but frequently one of the storage get fully used (100%) but 3 others are standing in low use. Because of this, the node becomes slow and the whole cassandra cluster is impacted. We are losing data due to writes fails and availability for our customers. it was in this state for one hour, and we decided to restart it. We already removed 3 other instances because of this same issue." (see other screenshots) http://imageshack.com/a/img824/2391/s7q3.jpg http://imageshack.com/a/img10/556/zzk8.jpg Amazon support took a close look at the instance as well as it's underlying hardware for any potential health issues and both seem to be healthy. Have someone already experienced something like this ? Should I contact the AMI author better? Thanks a lot, Philippe.