"The harder thing to solve is a bad coordinator node slowing down all reads coordinated by that node" I think this is the root of the problem and since all nodes act as coordinator nodes, so it guaranteed that if any 1 node slows down (High GC, Segment Merging etc), it will slow down 1/N queries in the cluster (N = ring size).
Speculative retry seems like a good option (non-percentile based) if it also mandates the selection of a different server in the retry. Is any kind of speculative retry turned on by default ? On Wed, Oct 13, 2021 at 2:33 PM Jeff Jirsa <jji...@gmail.com> wrote: > Some random notes, not necessarily going to help you, but: > - You probably have vnodes enable, which means one bad node is PROBABLY a > replica of almost every other node, so the fanout here is worse than it > should be, and > - You probably have speculative retry on the table set to a percentile. As > the host gets slow, the percentiles change, and speculative retry stop > being useful, so you end up timing out queries > > If you change speculative retry to use the MIN(Xms, p99) syntax, with X > set on your real workload, you can likely force it to speculate sooner when > that one host gets sick. > > The harder thing to solve is a bad coordinator node slowing down all reads > coordinated by that node. Retry at the client level to work around that > tends to be effective. > > > > On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote: > >> Hello, >> >> We have frequently seen that a single bad node running slow can affect >> the latencies of the entire cluster (especially for queries where the slow >> node was acting as a coordinator). >> >> >> Is there any suggestion to avoid this behavior? >> >> Like something on the client side to not query that bad node or something >> on the bad node that redirects its query to other healthy coordinators? >> >> >> Thanks, >> >> >>