"The harder thing to solve is a bad coordinator node slowing down all reads
coordinated by that node"
I think this is the root of the problem and since all nodes act as
coordinator nodes, so it guaranteed that if any 1 node slows down (High GC,
Segment Merging etc), it will slow down 1/N queries in the cluster (N =
ring size).

Speculative retry seems like a good option (non-percentile based) if it
also mandates the selection of a different server in the retry.

Is any kind of speculative retry turned on by default ?



On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa <jji...@gmail.com> wrote:

> Some random notes, not necessarily going to help you, but:
> - You probably have vnodes enable, which means one bad node is PROBABLY a
> replica of almost every other node, so the fanout here is worse than it
> should be, and
> - You probably have speculative retry on the table set to a percentile. As
> the host gets slow, the percentiles change, and speculative retry stop
> being useful, so you end up timing out queries
>
> If you change speculative retry to use the MIN(Xms, p99) syntax, with X
> set on your real workload, you can likely force it to speculate sooner when
> that one host gets sick.
>
> The harder thing to solve is a bad coordinator node slowing down all reads
> coordinated by that node. Retry at the client level to work around that
> tends to be effective.
>
>
>
> On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote:
>
>> Hello,
>>
>> We have frequently seen that a single bad node running slow can affect
>> the latencies of the entire cluster (especially for queries where the slow
>> node was acting as a coordinator).
>>
>>
>> Is there any suggestion to avoid this behavior?
>>
>> Like something on the client side to not query that bad node or something
>> on the bad node that redirects its query to other healthy coordinators?
>>
>>
>> Thanks,
>>
>>
>>

Reply via email to