We are performing the repair on one node only. Other nodes receive reasonable amounts of data (~500MB). It's only the repairing node itself which 'explodes'.
I must admit that I'm a noob when it comes to aes/repair. Its just strange that a cluster that is up and running with no probs is doing that. But I understand that its not supposed to do what its doing. I just hope that I find out why soon enough. On 23.05.2011, at 21:21, Peter Schuller <peter.schul...@infidyne.com> wrote: >> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't >> really work the way I expected but I thought that would be a bug which only >> affects that special case. >> >> So I tried again for all CFs. >> >> I started with a nicely compacted machine with around 320GB of load. Total >> disc space on this node was 1.1TB. > > Did you do repairs simultaneously on all nodes? > > I have seen very significant disk space increases under some > circumstances. While I haven't filed a ticket about it because there > was never time to confirm, I believe two things were at play: > > (1) nodes were sufficiently out a sync in a sufficiently spread out > fashion that the granularity of the merkle tree (IIRC, and if I read > correctly, it divides the ring into up to 2^15 segments but no more) > became ineffective so that repair effectively had to transfer all the > data. at first I thought there was an outright bug, but after looking > at the code I suspected it was just the merkle tree granularity. > > (2) I suspected at the time that a contributing factor was also that > as one repair might cause a node to significantly increase it's live > sstables temporarily until they are compacted, another repair on > another node may start and start validating compaction and streaming > of that data - leading to disk space bload essentially being > "contagious"; the third node streaming from the node that was > temporarily bloated, will receive even more data from that node than > it normally would. > > We're making sure to only run one repair at a time between any hosts > that are neighbors of each other (meaning that at RF=3, that's 1 > concurrent repair per 6 nodes in the cluster). > > I'd be interested in hearing anyone confirm or deny whether my > understanding of (1) in particular is correct. To connect it to > reality: a 20 GB CF divided into 2^15 segments implies each segment is >> 600 kbyte in size. For CF:s with tens or hundreds of millions of > small rows and a fairly random (with respect to partitioner) update > pattern, it's not very difficult to end up in a situation where most > 600 kbyte chunks contain out-of-synch data. Particularly in a > situation with lots of dropped messages. > > I'm getting the 2^15 from AntiEntropyService.Validator.Validator() > which passes a maxsize of 2^15 to the MerkelTree constructor. > > -- > / Peter Schuller