We are performing the repair on one node only. Other nodes receive reasonable 
amounts of data (~500MB).  It's only the repairing node itself which 
'explodes'. 

I must admit that I'm a noob when it comes to aes/repair. Its just strange that 
a cluster that is up and running with no probs is doing that. But I understand 
that its not supposed to do what its doing. I just hope that I find out why 
soon enough. 



On 23.05.2011, at 21:21, Peter Schuller <peter.schul...@infidyne.com> wrote:

>> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't 
>> really work the way I expected but I thought that would be a bug which only 
>> affects that special case.
>> 
>> So I tried again for all CFs.
>> 
>> I started with a nicely compacted machine with around 320GB of load. Total 
>> disc space on this node was 1.1TB.
> 
> Did you do repairs simultaneously on all nodes?
> 
> I have seen very significant disk space increases under some
> circumstances. While I haven't filed a ticket about it because there
> was never time to confirm, I believe two things were at play:
> 
> (1) nodes were sufficiently out a sync in a sufficiently spread out
> fashion that the granularity of the merkle tree (IIRC, and if I read
> correctly, it divides the ring into up to 2^15 segments but no more)
> became ineffective so that repair effectively had to transfer all the
> data. at first I thought there was an outright bug, but after looking
> at the code I suspected it was just the merkle tree granularity.
> 
> (2) I suspected at the time that a contributing factor was also that
> as one repair might cause a node to significantly increase it's live
> sstables temporarily until they are compacted, another repair on
> another node may start and start validating compaction and streaming
> of that data - leading to disk space bload essentially being
> "contagious"; the third node streaming from the node that was
> temporarily bloated, will receive even more data from that node than
> it normally would.
> 
> We're making sure to only run one repair at a time between any hosts
> that are neighbors of each other (meaning that at RF=3, that's 1
> concurrent repair per 6 nodes in the cluster).
> 
> I'd be interested in hearing anyone confirm or deny whether my
> understanding of (1) in particular is correct. To connect it to
> reality: a 20 GB CF divided into 2^15 segments implies each segment is
>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
> small rows and a fairly random (with respect to partitioner) update
> pattern, it's not very difficult to end up in a situation where most
> 600 kbyte chunks contain out-of-synch data. Particularly in a
> situation with lots of dropped messages.
> 
> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
> which passes a maxsize of 2^15 to the MerkelTree constructor.
> 
> -- 
> / Peter Schuller

Reply via email to