Thanks Sylvain

well no I don't really understand it at all. We have all 

Wide rows / small val to single larger column in one row.

The problem hits every CF. RF = 3 Read / Write with Quorum. 

The CF that is killing me right now is one col thats never updated (its WORM - 
updates are reinserts under a new key and a delete of the old one - to avoid 
updates of large CF). 250GB per node.
Unfortunately restarting the node doesn't stop repair so the repair started 
again. I deleted all tmp files before restarting but its out of space again. du 
-hcs shows 780GB for that CF now.

Guess I have to restart all nodes to stop repair?

To answer the question: yes the cluster might be a little out of synch but not 
that much. 

What I dont understand: I saw that the repairing node was still doing a 
validation compaction on that major sstable file (200GB) but it already 
received loads of data for that CF from the other nodes.

Sigh...


On May 23, 2011, at 7:48 PM, Sylvain Lebresne wrote:

> On Mon, May 23, 2011 at 7:17 PM, Daniel Doubleday
> <daniel.double...@gmx.net> wrote:
>> Hi all
>> 
>> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't 
>> really work the way I expected but I thought that would be a bug which only 
>> affects that special case.
>> 
>> So I tried again for all CFs.
>> 
>> I started with a nicely compacted machine with around 320GB of load. Total 
>> disc space on this node was 1.1TB.
>> 
>> After it went out of disc space (meaning I received around 700GB of data) I 
>> had a very brief look at the repair code again and it seems to me that the 
>> repairing node will get all data for its range from all its neighbors.
> 
> The repaired node is supposed to get only data from it's
> neighbors for rows it is not in sync with. That is all supposed
> to depend on how much the node is out of sync compared to
> the other nodes.
> 
> Now there is a number of things that could make it repair more
> that what you would hope. For instance:
>  1) even if one column is different for a row, the full row is
>      repaired. If you have a small number of huge rows, that
>      can amount for quite some data useless transfered.
>  2) The other one is that the merkle tree (that allows to say
>      whether 2 rows are in sync) doesn't necessarily have one
>      hash by row, so in theory one column not in sync may imply
>      the repair of more than one row.
>  3) https://issues.apache.org/jira/browse/CASSANDRA-2324 (which
>      is fixed in 0.8)
> 
> Fortunately, the chance to get hit by 1) is proportionally inverse
> to the change of getting hit by 2) and vice versa.
> 
> Anyway, the kind of excess data your seeing is not something
> I would expect unless the node is really completely out of sync
> with all the other nodes.
> So in the light of this, do you have more info on your own case ?
> (do you lots of small row, few of large ones ? Did you expected
> the node to be widely out of sync with the other nodes ? Etc..)
> 
> 
> --
> Sylvain
> 
>> 
>> Is that true and if so is it the intended behavior? If so one would rather 
>> need 5-6 times of disc space given that compactions that need to run after 
>> the sstable rebuild also need temp disc space.
>> 
>> Cheers,
>> Daniel

Reply via email to