Latency spike

Fabien Rousseau Tue, 01 Sep 2015 15:44:26 -0700

Hi Alain,

Maybe it's possible to confirm this by testing on a small cluster:
- create a cluster of 2 nodes (using https://github.com/pcmanus/ccm for
example)
- create a fake wide row of a few mb (using the python driver for example)
- drain and stop one of the two nodes
- remove the sstables of the stopped node (to provoke inconsistencies)
- start it again
- select a small portion of the wide row (many times, use nodetool tpstats
to know when a read repair has been triggered)
- nodetool flush (on the previously stopped node)
- check the size of the sstable (if a few kb, then only the selected slice
was repaired, but if a few mb then the whole row was repaired)


The wild guess was: if a read repair was triggered when reading a small
portion of a wide row and if it resulted in streaming the whole wide row,
it could explain a network burst. (But, on a second thought it make more
sense to only repair the small portion being read...)



2015-09-01 12:05 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Hi Fabien, thanks for your help.
>
> I did not mention it but I indeed saw a correlation between latency and
> read repairs spikes. Though this is like going from 5 RR per second to 10
> per sec cluster wide according to opscenter: http://img42.com/L6gx1
>
> I have indeed some wide rows and this explanation looks reasonable to me,
> I mean this makes sense. Yet isn't this amount of Read Repair too low to
> induce such a "shitstorm" (even if it spikes x2, I got network x10) ? Also
> wide rows are present on heavy used tables (sadly...), so I should be using
> more network all the time (why only a few spikes per day (like 2 / 3 max) ?
>
> How could I confirm this, without removing RR and waiting a week I mean,
> is there a way to see the size of the data being repaired through this
> mechanism ?
>
> C*heers
>
> Alain
>
> 2015-09-01 0:11 GMT+02:00 Fabien Rousseau <fabifab...@gmail.com>:
>
>> Hi Alain,
>>
>> Could it be wide rows + read repair ? (Let's suppose the "read repair"
>> repairs the full row, and it may not be subject to stream throughput limit)
>>
>> Best Regards
>> Fabien
>>
>> 2015-08-31 15:56 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>:
>>
>>> I just realised that I have no idea about how this mailing list handle
>>> attached files.
>>>
>>> Please find screenshots there --> http://img42.com/collection/y2KxS
>>>
>>> Alain
>>>
>>> 2015-08-31 15:48 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Running a 2.0.16 C* on AWS (private VPC, 2 DC).
>>>>
>>>> I am facing an issue on our EU DC where I have a network burst
>>>> (alongside with GC and latency increase).
>>>>
>>>> My first thought was a sudden application burst, though, I see no
>>>> corresponding evolution on reads / write or even CPU.
>>>>
>>>> So I thought that this might come from the node themselves as IN almost
>>>> equal OUT Network. I tried lowering stream throughput on the whole DC to 1
>>>> Mbps, with ~30 nodes --> 30 Mbps --> ~4 MB/s max. My network went a lot
>>>> higher about 30 M in both sides (see screenshots attached).
>>>>
>>>> I have tried to use iftop to see where this network is headed too, but
>>>> I was not able to do it because burst are very shorts.
>>>>
>>>> So, questions are:
>>>>
>>>> - Did someone experienced something similar already ? If so, any clue
>>>> would be appreciated :).
>>>> - How can I know (monitor, capture) where this big amount of network is
>>>> headed to or due to ?
>>>> - Am I right trying to figure out what this network is or should I
>>>> follow an other lead ?
>>>>
>>>> Notes: I also noticed that CPU does not spike nor does R&W, but disk
>>>> reads also spikes !
>>>>
>>>> C*heers,
>>>>
>>>> Alain
>>>>
>>>
>>>
>>
>

Re: Network / GC / Latency spike

Reply via email to