On Thu, Aug 12, 2010 at 9:08 AM, Julie <julie.su...@nextcentury.com> wrote: > I am chasing down a row size discrepancy and am confused. > > I populated a single node Cassandra cluster with 10,000 rows of data, using > numeric keys 1-10,000, where each row is a little over 100kB in length and has > a single column in it. > > When I perform a cfstats on the node immediately after writing the data, it > reports that the Compacted row minimum size = Compacted row maximum size which > is a little over 100,000 bytes. This is what I expect. > > I then run an application that randomly reads rows and adds a timestamp column > to each row read. This timestamp column name and column value is just adding > a few bytes to the row. > > But after running my reading app for a few hours, cfstats reports a very odd > minimum row size (and compacted mean row size): > > [r...@ec2-server1 ~]# /mnt/server/apache-cassandra-0.6.2/bin/nodetool -h > ec2-server1 -p 8080 cfstats > Keyspace: Keyspace1 > Read Count: 670434 > Read Latency: 36.22349047035205 ms. > Write Count: 1519933 > Write Latency: 0.02940705741634664 ms. > Pending Tasks: 0 > Column Family: Standard1 > SSTable count: 6 > Space used (live): 11130225642 > Space used (total): 11130225642 > Memtable Columns Count: 1435 > Memtable Data Size: 40344907 > Memtable Switch Count: 1329 > Read Count: 670434 > Read Latency: 41.768 ms. > Write Count: 1519933 > Write Latency: 0.025 ms. > Pending Tasks: 0 > Key cache capacity: 200000 > Key cache size: 200000 > Key cache hit rate: 0.48049934471509675 > Row cache: disabled > Compacted row minimum size: 238 > Compacted row maximum size: 100323 > Compacted row mean size: 67548 > > I thought I had a bug in my code so I wrote another app to read every row > in the database, keys 1-10,000. I get the size of each row after reading it > (by adding up all column names and column values in the row and the size of > the key string) and this matches what I expect -- every single key in this > table has a size of just over 100,000 bytes. (I know there are some > overhead columns in each row but I assume these will only make the row > larger, not smaller.) > > So I am confused about where cfstats is getting the row sizes it is working > with? > > When I add the timestamp column to each row, I am not deleting the other > column (large) in the row but I am not rewriting the large column either.
I'm guessing (haven't read this part of the source) that the min size is being generated in minor compaction, which doesn't see the whole row. -ryan