Thanks Sylvain well no I don't really understand it at all. We have all
Wide rows / small val to single larger column in one row. The problem hits every CF. RF = 3 Read / Write with Quorum. The CF that is killing me right now is one col thats never updated (its WORM - updates are reinserts under a new key and a delete of the old one - to avoid updates of large CF). 250GB per node. Unfortunately restarting the node doesn't stop repair so the repair started again. I deleted all tmp files before restarting but its out of space again. du -hcs shows 780GB for that CF now. Guess I have to restart all nodes to stop repair? To answer the question: yes the cluster might be a little out of synch but not that much. What I dont understand: I saw that the repairing node was still doing a validation compaction on that major sstable file (200GB) but it already received loads of data for that CF from the other nodes. Sigh... On May 23, 2011, at 7:48 PM, Sylvain Lebresne wrote: > On Mon, May 23, 2011 at 7:17 PM, Daniel Doubleday > <daniel.double...@gmx.net> wrote: >> Hi all >> >> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't >> really work the way I expected but I thought that would be a bug which only >> affects that special case. >> >> So I tried again for all CFs. >> >> I started with a nicely compacted machine with around 320GB of load. Total >> disc space on this node was 1.1TB. >> >> After it went out of disc space (meaning I received around 700GB of data) I >> had a very brief look at the repair code again and it seems to me that the >> repairing node will get all data for its range from all its neighbors. > > The repaired node is supposed to get only data from it's > neighbors for rows it is not in sync with. That is all supposed > to depend on how much the node is out of sync compared to > the other nodes. > > Now there is a number of things that could make it repair more > that what you would hope. For instance: > 1) even if one column is different for a row, the full row is > repaired. If you have a small number of huge rows, that > can amount for quite some data useless transfered. > 2) The other one is that the merkle tree (that allows to say > whether 2 rows are in sync) doesn't necessarily have one > hash by row, so in theory one column not in sync may imply > the repair of more than one row. > 3) https://issues.apache.org/jira/browse/CASSANDRA-2324 (which > is fixed in 0.8) > > Fortunately, the chance to get hit by 1) is proportionally inverse > to the change of getting hit by 2) and vice versa. > > Anyway, the kind of excess data your seeing is not something > I would expect unless the node is really completely out of sync > with all the other nodes. > So in the light of this, do you have more info on your own case ? > (do you lots of small row, few of large ones ? Did you expected > the node to be widely out of sync with the other nodes ? Etc..) > > > -- > Sylvain > >> >> Is that true and if so is it the intended behavior? If so one would rather >> need 5-6 times of disc space given that compactions that need to run after >> the sstable rebuild also need temp disc space. >> >> Cheers, >> Daniel