Cassandra has a very high constant per-row overhead at the moment of around 40 bytes. Additionally, there is around 12 bytes of overhead per column. Finally, column names are repeated for each row.
CASSANDRA-674 and CASSANDRA-1207 will help with these overheads, but they will not be fixed until 0.8. The file format change should bring lovely things like compression and variable length encoding, which Cassandra will gain huge benefits from. But, "disk is cheap"... the solution for now is to add more nodes. And why not? Thanks, Stu -----Original Message----- From: "Julie" <julie.su...@nextcentury.com> Sent: Friday, July 9, 2010 9:58am To: user@cassandra.apache.org Subject: Help! Cassandra disk space utilization WAY higher than I would expect Hi guys, I am on the hook to explain why 30GB of data is filling up 106GB of disk space since this is concerning information for my project. We are very excited about the possibility of using Cassandra but need to understand this anomaly in order to feel confident. Does anyone know why this could be happening? cfstats reports that space used live is equal to space used total so I think the data is truly taking up 106GB, I just can't explain why. Space used (live): 113946099884 Space used (total): 113946099884 Thank you for any guidance! Julie