I have cassandra nodes with long uptime. Disk foot print for cassandra data older is different when I copy to a different folder. Why is that ? I have used rsync and cp. This can be very confusing when trying to do certain maintenance tasks like hardware upgrade on EC2 and backing up a snapshot.
I am talking about as much 100% different for 25-40GB of data. On copying they grow to double that. The server's folder is on EC2 magnetic instance-store and I copied to various EBS. I do not think that it's something weird about EC2; when I copied EBS data back to magnetic instance-store the size remains the same. So I am guessing there is some kind of cassandra magical compression that is fooling the operation system tools like du and df Some issue with commitlog folder too but the total size of this folder is not as big and differences is size percent is low. Thanks for any insight you can share k.z.