> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't 
> really work the way I expected but I thought that would be a bug which only 
> affects that special case.
>
> So I tried again for all CFs.
>
> I started with a nicely compacted machine with around 320GB of load. Total 
> disc space on this node was 1.1TB.

Did you do repairs simultaneously on all nodes?

I have seen very significant disk space increases under some
circumstances. While I haven't filed a ticket about it because there
was never time to confirm, I believe two things were at play:

(1) nodes were sufficiently out a sync in a sufficiently spread out
fashion that the granularity of the merkle tree (IIRC, and if I read
correctly, it divides the ring into up to 2^15 segments but no more)
became ineffective so that repair effectively had to transfer all the
data. at first I thought there was an outright bug, but after looking
at the code I suspected it was just the merkle tree granularity.

(2) I suspected at the time that a contributing factor was also that
as one repair might cause a node to significantly increase it's live
sstables temporarily until they are compacted, another repair on
another node may start and start validating compaction and streaming
of that data - leading to disk space bload essentially being
"contagious"; the third node streaming from the node that was
temporarily bloated, will receive even more data from that node than
it normally would.

We're making sure to only run one repair at a time between any hosts
that are neighbors of each other (meaning that at RF=3, that's 1
concurrent repair per 6 nodes in the cluster).

I'd be interested in hearing anyone confirm or deny whether my
understanding of (1) in particular is correct. To connect it to
reality: a 20 GB CF divided into 2^15 segments implies each segment is
> 600 kbyte in size. For CF:s with tens or hundreds of millions of
small rows and a fairly random (with respect to partitioner) update
pattern, it's not very difficult to end up in a situation where most
600 kbyte chunks contain out-of-synch data. Particularly in a
situation with lots of dropped messages.

I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
which passes a maxsize of 2^15 to the MerkelTree constructor.

-- 
/ Peter Schuller

Reply via email to