Hi Alain, Maybe it's possible to confirm this by testing on a small cluster: - create a cluster of 2 nodes (using https://github.com/pcmanus/ccm for example) - create a fake wide row of a few mb (using the python driver for example) - drain and stop one of the two nodes - remove the sstables of the stopped node (to provoke inconsistencies) - start it again - select a small portion of the wide row (many times, use nodetool tpstats to know when a read repair has been triggered) - nodetool flush (on the previously stopped node) - check the size of the sstable (if a few kb, then only the selected slice was repaired, but if a few mb then the whole row was repaired)
The wild guess was: if a read repair was triggered when reading a small portion of a wide row and if it resulted in streaming the whole wide row, it could explain a network burst. (But, on a second thought it make more sense to only repair the small portion being read...) 2015-09-01 12:05 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>: > Hi Fabien, thanks for your help. > > I did not mention it but I indeed saw a correlation between latency and > read repairs spikes. Though this is like going from 5 RR per second to 10 > per sec cluster wide according to opscenter: http://img42.com/L6gx1 > > I have indeed some wide rows and this explanation looks reasonable to me, > I mean this makes sense. Yet isn't this amount of Read Repair too low to > induce such a "shitstorm" (even if it spikes x2, I got network x10) ? Also > wide rows are present on heavy used tables (sadly...), so I should be using > more network all the time (why only a few spikes per day (like 2 / 3 max) ? > > How could I confirm this, without removing RR and waiting a week I mean, > is there a way to see the size of the data being repaired through this > mechanism ? > > C*heers > > Alain > > 2015-09-01 0:11 GMT+02:00 Fabien Rousseau <fabifab...@gmail.com>: > >> Hi Alain, >> >> Could it be wide rows + read repair ? (Let's suppose the "read repair" >> repairs the full row, and it may not be subject to stream throughput limit) >> >> Best Regards >> Fabien >> >> 2015-08-31 15:56 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>: >> >>> I just realised that I have no idea about how this mailing list handle >>> attached files. >>> >>> Please find screenshots there --> http://img42.com/collection/y2KxS >>> >>> Alain >>> >>> 2015-08-31 15:48 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>: >>> >>>> Hi, >>>> >>>> Running a 2.0.16 C* on AWS (private VPC, 2 DC). >>>> >>>> I am facing an issue on our EU DC where I have a network burst >>>> (alongside with GC and latency increase). >>>> >>>> My first thought was a sudden application burst, though, I see no >>>> corresponding evolution on reads / write or even CPU. >>>> >>>> So I thought that this might come from the node themselves as IN almost >>>> equal OUT Network. I tried lowering stream throughput on the whole DC to 1 >>>> Mbps, with ~30 nodes --> 30 Mbps --> ~4 MB/s max. My network went a lot >>>> higher about 30 M in both sides (see screenshots attached). >>>> >>>> I have tried to use iftop to see where this network is headed too, but >>>> I was not able to do it because burst are very shorts. >>>> >>>> So, questions are: >>>> >>>> - Did someone experienced something similar already ? If so, any clue >>>> would be appreciated :). >>>> - How can I know (monitor, capture) where this big amount of network is >>>> headed to or due to ? >>>> - Am I right trying to figure out what this network is or should I >>>> follow an other lead ? >>>> >>>> Notes: I also noticed that CPU does not spike nor does R&W, but disk >>>> reads also spikes ! >>>> >>>> C*heers, >>>> >>>> Alain >>>> >>> >>> >> >