Just noticed this thread and thought I’d chime in with my experiences.

I run Solr on Kubernetes and it’s pretty high throughput (it powers the search 
for autotrader.co.uk).

During node rollouts, which happen for a variety of reasons (solr upgrades, 
Kubernetes upgrades, etc) we experience those same latency spikes.

In the end, we ended up writing a custom Kubernetes controller which uses 
readiness gates 
(https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)
 to effectively “warm” a solr node before it starts getting load balancer 
traffic.  What happens is:


  *   Node comes up
  *   Becomes “ready” but doesn’t pass readiness gateway
  *   Controller sees the pod is ready and uses istios traffic mirroring 
(https://istio.io/latest/docs/tasks/traffic-management/mirroring/) to gradually 
ramp up traffic (25, 50, 75%) over the course of a few minutes.
  *   Controller adds readiness gate to make it pass, thus adding it to the 
actual load balancer

This pattern allows us to sufficiently warm the solr caches and do rolling 
restarts of solr even under full load, without any consumer latency impact, as 
the warming is done with mirrored traffic before consumer traffic is sent.

Another pattern you could consider is envoys `slow start` 
(https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/slow_start),
 this allows endpoints to be gradually ramped up.  However this will be using 
actual consumer traffic so whilst it would help with overwhelming a solr node 
when it first starts, some consumers will experience slow query performance 
initially.

Both of these solutions are Kubernetes-centric but the general principle could 
be applied anywhere.

Hope this helps you.

From: rajani m <rajinima...@gmail.com>
Date: Wednesday, 13 September 2023 at 19:15
To: users@solr.apache.org <users@solr.apache.org>
Subject: Re: Restart on a node triggers restart like impact on all the other 
nodes in cluster
[You don't often get email from rajinima...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hi,

  Thank you for looking into this one.

I tailed the logs and figured it is the new node that is the culprit, it is
failing search queries upon startup for sometime. The errors as seen below.
This query failure on that node and shard(s) is causing overall aggregated
search query to take about 7-10 seconds causing latency spikes and the
consequence of which load balancer is sending lower number of requests and
hence other nodes have dip in search qps there by dip in their resource
usage.

A node goes off and recovers fully, zk immediately starts sending the
search queries to it and fails x queries for sometime(2 minutes) before it
actually returns 200 responses to queries. What is the cause of the first
error, anyway to avoid it?

Thinking if there is a way to configure zk to not send traffic to that node
shards for a few minutes and warm it up with some queries before it starts
sending queries to it?


Error -

2023-09-13 17:50:34.277 ERROR (qtp517787604-178) [c:v9-web s:shard34
r:core_node772 x:v9-web_shard34_replica_n771] o.a.s.h.RequestHandlerBase
java.util.ConcurrentModificationException =>
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1221)
java.util.ConcurrentModificationException: null
at java.util.HashMap.computeIfAbsent(HashMap.java:1221) ~[?:?]
at
org.apache.solr.schema.IndexSchema.getPayloadDecoder(IndexSchema.java:2118)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$66.parse(ValueSourceParser.java:899)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:272)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$DoubleParser.parse(ValueSourceParser.java:1646)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.ExtendedDismaxQParser.getMultiplicativeBoosts(ExtendedDismaxQParser.java:532)
~[?:?]

and

query params  =>
org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The
request took too long to iterate over point values. Timeout: timeoutAt:
130698068079 (System.nanoTime(): 130791560015),
PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482)
org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The
request took too long to iterate over point values. Timeout: timeoutAt:
130698068079 (System.nanoTime(): 130791560015),
PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482)
~[?:?]
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.<init>(ExitableDirectoryReader.java:471)
~[?:?]
at
org.apache.lucene.index.ExitableDirectoryReader$ExitableFilterAtomicReader.getPointValues(ExitableDirectoryReader.java:85)
~[?:?]
at
org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator.<init>(NumericComparator.java:107)
~[?:?]
at
org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.<init>(LongComparator.java:65)
~[?:?]
at
org.apache.lucene.search.comparators.LongComparator.getLeafComparator(LongComparator.java:58)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.doFieldSortValues(QueryComponent.java:523)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1661)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:421)
~[?:?]
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:420)
~[?:?]
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224)
~[?:?]
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2865) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:887)
~[?:?]
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:606) ~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218)
~[?:?]
at
org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
~[?:?]
at
org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
~[?:?]
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
~[jetty-security-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]

Thanks,
Rajani



On Wed, Sep 13, 2023 at 1:59 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 9/12/23 18:28, rajani m wrote:
> >    Solr 9.1.1 version, upon restarting solr on any node in the cluster, a
> > unique event is triggered across all the *other* nodes in the cluster
> that
> > has an impact similar to restarting solr on all the other nodes in the
> > cluster. There is dip in the cpu usage, all the caches are emptied and
> > warmed up, there are disk reads/writes on all the other nodes.
>
> How much RAM is in each node?  How much is given to the Java heap?  Are
> you running more than one Solr instance on each node?  How much disk
> space do the indexes on each node consume?
>
> What are the counts of:
>
> * Nodes
> * Collections
> * Shards per collection
> * Replica count per shard
> * Documents per shard
>
> There is sometimes some confusion about replica count.  I've seen people
> say they have "one shard and one replica" when the right way to state it
> is that the replica count is two.
>
> If the counts above are large (meaning that you have a LOT of cores)
> then restarting a node can be very disruptive to the cloud as a whole.
> See this issue from several years ago where I explored this:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-7191&data=05%7C01%7CKarl.Stoney%40autotrader.co.uk%7C40d6f245b2f44aff6ae608dbb4855acd%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C638302257079643606%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CMSCfriAqQ0SGioYROaJqfGcnx%2F76w5OJ7cA1o5JxIU%3D&reserved=0<https://issues.apache.org/jira/browse/SOLR-7191>
>
> The issue has been marked as resolved in version 6.3.0, but no code was
> modified, and as far as I know, the problem still exists.
>
> It's worth noting that in my tests for that issue, the collections were
> empty.  For collections that actually have data, the problem will be worse.
>
> If there are a lot of adds/updates/deletes happening, then the delta
> between the replicas might exceed the threshold for transaction log
> recovery.  Solr may be doing a full replication to the cores on the
> restarted node.  But I would expect that to only affect the shard
> leaders, which are the source for the replicated data.
>
> Thanks,
> Shawn
>
>


Unless expressly stated otherwise in this email, this e-mail is sent on behalf 
of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, 
Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited 
is part of the Auto Trader Group Plc group. This email and any files 
transmitted with it are confidential and may be legally privileged, and 
intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error please notify the sender. 
This email message has been swept for the presence of computer viruses.

Reply via email to