cassandra-people, I'm trying to measure disk usage by cassandra after inserting some columns in order to plan disk sizes and configurations for future deploys.
My approach is very straightforward: clean_data (stop_cassandra && rm -rf /var/lib/cassandra/{dara,commitlog,saved_caches}/*) perform_inserts measure_disk_usage (nodetool -flush && du -ch /var/lib/cassandra) There are two types of inserts: - In a simple column with key, name and value a random string of size 100 - In a super-column with key, super-column-name, name and value a random string of size 100 But surprisingly when I'm inserting 100 million columns on a simple column it uses more disk than the same amount in a super-column. How can that be possible? For simple column 41984 MB and for super-column 29696, the difference is more than noticeable! Somebody told me yesterday that super-columns don't have a per-column timestamp, but... it in my case, even if every data was in the same super-column-key it will not explain the difference! ps: sorry, English is not my first language
results.eps
Description: PostScript document