I am chasing down a row size discrepancy and am confused. I populated a single node Cassandra cluster with 10,000 rows of data, using numeric keys 1-10,000, where each row is a little over 100kB in length and has a single column in it.
When I perform a cfstats on the node immediately after writing the data, it reports that the Compacted row minimum size = Compacted row maximum size which is a little over 100,000 bytes. This is what I expect. I then run an application that randomly reads rows and adds a timestamp column to each row read. This timestamp column name and column value is just adding a few bytes to the row. But after running my reading app for a few hours, cfstats reports a very odd minimum row size (and compacted mean row size): [r...@ec2-server1 ~]# /mnt/server/apache-cassandra-0.6.2/bin/nodetool -h ec2-server1 -p 8080 cfstats Keyspace: Keyspace1 Read Count: 670434 Read Latency: 36.22349047035205 ms. Write Count: 1519933 Write Latency: 0.02940705741634664 ms. Pending Tasks: 0 Column Family: Standard1 SSTable count: 6 Space used (live): 11130225642 Space used (total): 11130225642 Memtable Columns Count: 1435 Memtable Data Size: 40344907 Memtable Switch Count: 1329 Read Count: 670434 Read Latency: 41.768 ms. Write Count: 1519933 Write Latency: 0.025 ms. Pending Tasks: 0 Key cache capacity: 200000 Key cache size: 200000 Key cache hit rate: 0.48049934471509675 Row cache: disabled Compacted row minimum size: 238 Compacted row maximum size: 100323 Compacted row mean size: 67548 I thought I had a bug in my code so I wrote another app to read every row in the database, keys 1-10,000. I get the size of each row after reading it (by adding up all column names and column values in the row and the size of the key string) and this matches what I expect -- every single key in this table has a size of just over 100,000 bytes. (I know there are some overhead columns in each row but I assume these will only make the row larger, not smaller.) So I am confused about where cfstats is getting the row sizes it is working with? When I add the timestamp column to each row, I am not deleting the other column (large) in the row but I am not rewriting the large column either. Thanks for your help! Julie