Internode speculative retry is on by default with p99 The client side retry varies by driver / client
> On Oct 17, 2021, at 1:59 PM, S G <sg.online.em...@gmail.com> wrote: > > > > "The harder thing to solve is a bad coordinator node slowing down all reads > coordinated by that node" > I think this is the root of the problem and since all nodes act as > coordinator nodes, so it guaranteed that if any 1 node slows down (High GC, > Segment Merging etc), it will slow down 1/N queries in the cluster (N = ring > size). > > Speculative retry seems like a good option (non-percentile based) if it also > mandates the selection of a different server in the retry. > > Is any kind of speculative retry turned on by default ? > > > >> On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa <jji...@gmail.com> wrote: >> Some random notes, not necessarily going to help you, but: >> - You probably have vnodes enable, which means one bad node is PROBABLY a >> replica of almost every other node, so the fanout here is worse than it >> should be, and >> - You probably have speculative retry on the table set to a percentile. As >> the host gets slow, the percentiles change, and speculative retry stop being >> useful, so you end up timing out queries >> >> If you change speculative retry to use the MIN(Xms, p99) syntax, with X set >> on your real workload, you can likely force it to speculate sooner when that >> one host gets sick. >> >> The harder thing to solve is a bad coordinator node slowing down all reads >> coordinated by that node. Retry at the client level to work around that >> tends to be effective. >> >> >> >>> On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote: >>> Hello, >>> >>> We have frequently seen that a single bad node running slow can affect the >>> latencies of the entire cluster (especially for queries where the slow node >>> was acting as a coordinator). >>> >>> Is there any suggestion to avoid this behavior? >>> Like something on the client side to not query that bad node or something >>> on the bad node that redirects its query to other healthy coordinators? >>> >>> Thanks, >>>