On Tue, May 24, 2011 at 9:41 AM, Sylvain Lebresne <sylv...@datastax.com>wrote:
> On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday > <daniel.double...@gmx.net> wrote: > > We are performing the repair on one node only. Other nodes receive > reasonable amounts of data (~500MB). It's only the repairing node itself > which 'explodes'. > > That, for instance, is a bit weird. That the node on which the repair > is performed get more data is expected, since it is repair with all > it's "neighbor" while the neighbors themselves get repaired only > against that given node. But when differences between two A and B are > computed, the ranges to repair are streaming both from A to B and for > B to A. Unless A and B are widely out of sync (like A has no data and > B has tons of it), around the same amount of data should transit in > both way. So with RF=3, the node on with repair was started should get > around 4 times (up to 6 times if you have weird topology) as much data > than any neighboring node, but that's is. While if I'm correct, you > are reporting that the neighboring node gets ~500MB and the > "coordinator" gets > 700GB ?! > Honestly I'm not sure an imprecision of the merkle tree could account > for that behavior. > > Anyway, Daniel, would you be able to share the logs of the nodes (at > least the node on which repair is started) ? I'm not sure how much > that could help but that cannot hurt. > > -- > Sylvain > > > > > I must admit that I'm a noob when it comes to aes/repair. Its just > strange that a cluster that is up and running with no probs is doing that. > But I understand that its not supposed to do what its doing. I just hope > that I find out why soon enough. > > > > > > > > On 23.05.2011, at 21:21, Peter Schuller <peter.schul...@infidyne.com> > wrote: > > > >>> I'm a bit lost: I tried a repair yesterday with only one CF and that > didn't really work the way I expected but I thought that would be a bug > which only affects that special case. > >>> > >>> So I tried again for all CFs. > >>> > >>> I started with a nicely compacted machine with around 320GB of load. > Total disc space on this node was 1.1TB. > >> > >> Did you do repairs simultaneously on all nodes? > >> > >> I have seen very significant disk space increases under some > >> circumstances. While I haven't filed a ticket about it because there > >> was never time to confirm, I believe two things were at play: > >> > >> (1) nodes were sufficiently out a sync in a sufficiently spread out > >> fashion that the granularity of the merkle tree (IIRC, and if I read > >> correctly, it divides the ring into up to 2^15 segments but no more) > >> became ineffective so that repair effectively had to transfer all the > >> data. at first I thought there was an outright bug, but after looking > >> at the code I suspected it was just the merkle tree granularity. > >> > >> (2) I suspected at the time that a contributing factor was also that > >> as one repair might cause a node to significantly increase it's live > >> sstables temporarily until they are compacted, another repair on > >> another node may start and start validating compaction and streaming > >> of that data - leading to disk space bload essentially being > >> "contagious"; the third node streaming from the node that was > >> temporarily bloated, will receive even more data from that node than > >> it normally would. > >> > >> We're making sure to only run one repair at a time between any hosts > >> that are neighbors of each other (meaning that at RF=3, that's 1 > >> concurrent repair per 6 nodes in the cluster). > >> > >> I'd be interested in hearing anyone confirm or deny whether my > >> understanding of (1) in particular is correct. To connect it to > >> reality: a 20 GB CF divided into 2^15 segments implies each segment is > >>> 600 kbyte in size. For CF:s with tens or hundreds of millions of > >> small rows and a fairly random (with respect to partitioner) update > >> pattern, it's not very difficult to end up in a situation where most > >> 600 kbyte chunks contain out-of-synch data. Particularly in a > >> situation with lots of dropped messages. > >> > >> I'm getting the 2^15 from AntiEntropyService.Validator.Validator() > >> which passes a maxsize of 2^15 to the MerkelTree constructor. > >> > >> -- > >> / Peter Schuller > > > I never run repair for just this reason. It is very intensive and it produces a lot of data. I am still in 0.6.X so this would be better for me when I upgrade. If i had to take a wild stab at it I would guess that "guys like us" with 300 GB of data and possibly tiny rows run a greater chance of something not being in sync then those with 20 GB a node for example. If your are doing a high volume of inserts and disabled HH or even set it to only store hints for an hour, and you had a three hour outage some nodes are going to be out of sync.