did you try compact instead of cleanup, anyway? On Tue, Jul 27, 2010 at 1:08 PM, Julie <julie.su...@nextcentury.com> wrote: > Peter Schuller <peter.schuller <at> infidyne.com> writes: > >> > a) cleanup is a superset of compaction, so if you've been doing >> > overwrites at all then it will reduce space used for that reason >> > > Hi Peter and Jonathan, > > In my test, I write 80,000 rows (100KB each row) to an 8 node cluster. The > 80,000 rows all have unique keys '1' through '80000' so no overwriting is > occurring. I also don't do any deletes. I simply write the 80,000 rows to > the 8 node cluster which should be about 1GB of data times 3 (replication > factor=3) on each node. > > The only thing I am doing special, is I use Random Partitioning and set the > Initial Token on each node to try to get the data evenly distributed: > > # Create tokens for the RandomPartitioner that evenly divide token space > # The RandomPatitioner hashes keys into integer tokens in the range 0 to > # 2^127. > # So we simply divide that space into N equal sections. > # serverCount = the number of Cassandra nodes in the cluster > > for ((ii=1; ii<=serverCount; ii++)); do > host=ec2-server$ii > echo Generating InitialToken for server on $host > token=$(bc<<-EOF > ($ii*(2^127))/$serverCount > EOF) > echo host=$host initialToken=$token > echo "<InitialToken>$token</InitialToken>" >> storage-conf-node.xml > cat storage-conf-node.xml > ... > done > > 24 hours after my writes, the data is evenly distributed according to > cfstats (I see almost identical numRows from node to node) but there is > a lot of extra disk space being used on some nodes, again according to > cfstats. This disk usage drops back down to 2.7GB (exactly what I expect > since that's how much raw data is on each node) when I run "nodetool > cleanup". > > I am confused why there is anything to clean up 24 hours after my last > write? All nodes in the cluster are fully up and aware of each other > before I begin the writes. The only other thing that could possibly be > considered unusual is I cycle through all 8 nodes, rather than > communicating with a single Cassandra node. I use a write consistency > setting of ALL. I can't see how these would increase the amount of disk > space used but just mentioning it. > > Any help would be greatly appreciated, > Julie > > Peter Schuller <peter.schuller <at> infidyne.com> writes: > >> > a) cleanup is a superset of compaction, so if you've been doing >> > overwrites at all then it will reduce space used for that reason >> >> I had failed to consider over-writes as a possible culprit (since >> removals were stated not to be done). However thinking about it I >> believe the effect of this should be limited to roughly a doubling of >> disk space in the absolute worst case of over-writing all data in the >> absolute worst possible order (such as writing everything twice in the >> same order). >> >> Or more accurately, it should be limited to wasting as much as space >> as the size of the overwritten values. If you're overwriting with >> larger values, it will no longer be a "doubling" relative to the >> actual live data set. >> >> Julie, did you do over-writes or was your disk space measurements >> based on the state of the cluster after an initial set of writes of >> unique values? > > > > >
-- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com