Peter Schuller <peter.schuller <at> infidyne.com> writes: > > a) cleanup is a superset of compaction, so if you've been doing > > overwrites at all then it will reduce space used for that reason >
Hi Peter and Jonathan, In my test, I write 80,000 rows (100KB each row) to an 8 node cluster. The 80,000 rows all have unique keys '1' through '80000' so no overwriting is occurring. I also don't do any deletes. I simply write the 80,000 rows to the 8 node cluster which should be about 1GB of data times 3 (replication factor=3) on each node. The only thing I am doing special, is I use Random Partitioning and set the Initial Token on each node to try to get the data evenly distributed: # Create tokens for the RandomPartitioner that evenly divide token space # The RandomPatitioner hashes keys into integer tokens in the range 0 to # 2^127. # So we simply divide that space into N equal sections. # serverCount = the number of Cassandra nodes in the cluster for ((ii=1; ii<=serverCount; ii++)); do host=ec2-server$ii echo Generating InitialToken for server on $host token=$(bc<<-EOF ($ii*(2^127))/$serverCount EOF) echo host=$host initialToken=$token echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml cat storage-conf-node.xml ... done 24 hours after my writes, the data is evenly distributed according to cfstats (I see almost identical numRows from node to node) but there is a lot of extra disk space being used on some nodes, again according to cfstats. This disk usage drops back down to 2.7GB (exactly what I expect since that's how much raw data is on each node) when I run "nodetool cleanup". I am confused why there is anything to clean up 24 hours after my last write? All nodes in the cluster are fully up and aware of each other before I begin the writes. The only other thing that could possibly be considered unusual is I cycle through all 8 nodes, rather than communicating with a single Cassandra node. I use a write consistency setting of ALL. I can't see how these would increase the amount of disk space used but just mentioning it. Any help would be greatly appreciated, Julie Peter Schuller <peter.schuller <at> infidyne.com> writes: > > a) cleanup is a superset of compaction, so if you've been doing > > overwrites at all then it will reduce space used for that reason > > I had failed to consider over-writes as a possible culprit (since > removals were stated not to be done). However thinking about it I > believe the effect of this should be limited to roughly a doubling of > disk space in the absolute worst case of over-writing all data in the > absolute worst possible order (such as writing everything twice in the > same order). > > Or more accurately, it should be limited to wasting as much as space > as the size of the overwritten values. If you're overwriting with > larger values, it will no longer be a "doubling" relative to the > actual live data set. > > Julie, did you do over-writes or was your disk space measurements > based on the state of the cluster after an initial set of writes of > unique values?