Re: Single node slowing down queries in a large cluster

Jeff Jirsa Sun, 17 Oct 2021 14:56:36 -0700

Internode speculative retry is on by default with p99

The client side retry varies by driver / client


> On Oct 17, 2021, at 1:59 PM, S G <sg.online.em...@gmail.com> wrote:
> 
> 
> 
> "The harder thing to solve is a bad coordinator node slowing down all reads 
> coordinated by that node"
> I think this is the root of the problem and since all nodes act as 
> coordinator nodes, so it guaranteed that if any 1 node slows down (High GC, 
> Segment Merging etc), it will slow down 1/N queries in the cluster (N = ring 
> size).
> 
> Speculative retry seems like a good option (non-percentile based) if it also 
> mandates the selection of a different server in the retry.
> 
> Is any kind of speculative retry turned on by default ?
> 
> 
> 
>> On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>> Some random notes, not necessarily going to help you, but:
>> - You probably have vnodes enable, which means one bad node is PROBABLY a 
>> replica of almost every other node, so the fanout here is worse than it 
>> should be, and
>> - You probably have speculative retry on the table set to a percentile. As 
>> the host gets slow, the percentiles change, and speculative retry stop being 
>> useful, so you end up timing out queries
>> 
>> If you change speculative retry to use the MIN(Xms, p99) syntax, with X set 
>> on your real workload, you can likely force it to speculate sooner when that 
>> one host gets sick.
>> 
>> The harder thing to solve is a bad coordinator node slowing down all reads 
>> coordinated by that node. Retry at the client level to work around that 
>> tends to be effective.
>> 
>> 
>> 
>>> On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote:
>>> Hello,
>>> 
>>> We have frequently seen that a single bad node running slow can affect the 
>>> latencies of the entire cluster (especially for queries where the slow node 
>>> was acting as a coordinator).
>>> 
>>> Is there any suggestion to avoid this behavior?
>>> Like something on the client side to not query that bad node or something 
>>> on the bad node that redirects its query to other healthy coordinators?
>>> 
>>> Thanks,
>>>

Re: Single node slowing down queries in a large cluster

Reply via email to