Also, for the percentile based speculative retry, how big of a time-period
is used to calculate the percentile?
If it is only a few seconds, then the latency will increase very quickly
when server performance degrades.
But if it is upto a few minutes (or it is configurable), then its
percentile will not shoot up in proportion to server's degrading health and
hence might be very good to use.
Can someone share how big of a time-period is used to calculate the
percentile?

On Sun, Oct 17, 2021 at 1:59 PM S G <sg.online.em...@gmail.com> wrote:

>
> "The harder thing to solve is a bad coordinator node slowing down all
> reads coordinated by that node"
> I think this is the root of the problem and since all nodes act as
> coordinator nodes, so it guaranteed that if any 1 node slows down (High GC,
> Segment Merging etc), it will slow down 1/N queries in the cluster (N =
> ring size).
>
> Speculative retry seems like a good option (non-percentile based) if it
> also mandates the selection of a different server in the retry.
>
> Is any kind of speculative retry turned on by default ?
>
>
>
> On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Some random notes, not necessarily going to help you, but:
>> - You probably have vnodes enable, which means one bad node is PROBABLY a
>> replica of almost every other node, so the fanout here is worse than it
>> should be, and
>> - You probably have speculative retry on the table set to a percentile.
>> As the host gets slow, the percentiles change, and speculative retry stop
>> being useful, so you end up timing out queries
>>
>> If you change speculative retry to use the MIN(Xms, p99) syntax, with X
>> set on your real workload, you can likely force it to speculate sooner when
>> that one host gets sick.
>>
>> The harder thing to solve is a bad coordinator node slowing down all
>> reads coordinated by that node. Retry at the client level to work around
>> that tends to be effective.
>>
>>
>>
>> On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We have frequently seen that a single bad node running slow can affect
>>> the latencies of the entire cluster (especially for queries where the slow
>>> node was acting as a coordinator).
>>>
>>>
>>> Is there any suggestion to avoid this behavior?
>>>
>>> Like something on the client side to not query that bad node or
>>> something on the bad node that redirects its query to other healthy
>>> coordinators?
>>>
>>>
>>> Thanks,
>>>
>>>
>>>

Reply via email to