Just noticed this thread and thought I’d chime in with my experiences. I run Solr on Kubernetes and it’s pretty high throughput (it powers the search for autotrader.co.uk).
During node rollouts, which happen for a variety of reasons (solr upgrades, Kubernetes upgrades, etc) we experience those same latency spikes. In the end, we ended up writing a custom Kubernetes controller which uses readiness gates (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate) to effectively “warm” a solr node before it starts getting load balancer traffic. What happens is: * Node comes up * Becomes “ready” but doesn’t pass readiness gateway * Controller sees the pod is ready and uses istios traffic mirroring (https://istio.io/latest/docs/tasks/traffic-management/mirroring/) to gradually ramp up traffic (25, 50, 75%) over the course of a few minutes. * Controller adds readiness gate to make it pass, thus adding it to the actual load balancer This pattern allows us to sufficiently warm the solr caches and do rolling restarts of solr even under full load, without any consumer latency impact, as the warming is done with mirrored traffic before consumer traffic is sent. Another pattern you could consider is envoys `slow start` (https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/slow_start), this allows endpoints to be gradually ramped up. However this will be using actual consumer traffic so whilst it would help with overwhelming a solr node when it first starts, some consumers will experience slow query performance initially. Both of these solutions are Kubernetes-centric but the general principle could be applied anywhere. Hope this helps you. From: rajani m <rajinima...@gmail.com> Date: Wednesday, 13 September 2023 at 19:15 To: users@solr.apache.org <users@solr.apache.org> Subject: Re: Restart on a node triggers restart like impact on all the other nodes in cluster [You don't often get email from rajinima...@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] Hi, Thank you for looking into this one. I tailed the logs and figured it is the new node that is the culprit, it is failing search queries upon startup for sometime. The errors as seen below. This query failure on that node and shard(s) is causing overall aggregated search query to take about 7-10 seconds causing latency spikes and the consequence of which load balancer is sending lower number of requests and hence other nodes have dip in search qps there by dip in their resource usage. A node goes off and recovers fully, zk immediately starts sending the search queries to it and fails x queries for sometime(2 minutes) before it actually returns 200 responses to queries. What is the cause of the first error, anyway to avoid it? Thinking if there is a way to configure zk to not send traffic to that node shards for a few minutes and warm it up with some queries before it starts sending queries to it? Error - 2023-09-13 17:50:34.277 ERROR (qtp517787604-178) [c:v9-web s:shard34 r:core_node772 x:v9-web_shard34_replica_n771] o.a.s.h.RequestHandlerBase java.util.ConcurrentModificationException => java.util.ConcurrentModificationException at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1221) java.util.ConcurrentModificationException: null at java.util.HashMap.computeIfAbsent(HashMap.java:1221) ~[?:?] at org.apache.solr.schema.IndexSchema.getPayloadDecoder(IndexSchema.java:2118) ~[?:?] at org.apache.solr.search.ValueSourceParser$66.parse(ValueSourceParser.java:899) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252) ~[?:?] at org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94) ~[?:?] at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252) ~[?:?] at org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94) ~[?:?] at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252) ~[?:?] at org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252) ~[?:?] at org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:272) ~[?:?] at org.apache.solr.search.ValueSourceParser$DoubleParser.parse(ValueSourceParser.java:1646) ~[?:?] at org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434) ~[?:?] at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94) ~[?:?] at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?] at org.apache.solr.search.ExtendedDismaxQParser.getMultiplicativeBoosts(ExtendedDismaxQParser.java:532) ~[?:?] and query params => org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The request took too long to iterate over point values. Timeout: timeoutAt: 130698068079 (System.nanoTime(): 130791560015), PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603 at org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482) org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The request took too long to iterate over point values. Timeout: timeoutAt: 130698068079 (System.nanoTime(): 130791560015), PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603 at org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482) ~[?:?] at org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.<init>(ExitableDirectoryReader.java:471) ~[?:?] at org.apache.lucene.index.ExitableDirectoryReader$ExitableFilterAtomicReader.getPointValues(ExitableDirectoryReader.java:85) ~[?:?] at org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator.<init>(NumericComparator.java:107) ~[?:?] at org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.<init>(LongComparator.java:65) ~[?:?] at org.apache.lucene.search.comparators.LongComparator.getLeafComparator(LongComparator.java:58) ~[?:?] at org.apache.solr.handler.component.QueryComponent.doFieldSortValues(QueryComponent.java:523) ~[?:?] at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1661) ~[?:?] at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:421) ~[?:?] at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:420) ~[?:?] at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) ~[?:?] at org.apache.solr.core.SolrCore.execute(SolrCore.java:2865) ~[?:?] at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:887) ~[?:?] at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:606) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218) ~[?:?] at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257) ~[?:?] at org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) ~[?:?] at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) ~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626) ~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552) ~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600) ~[jetty-security-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) ~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622] Thanks, Rajani On Wed, Sep 13, 2023 at 1:59 PM Shawn Heisey <apa...@elyograg.org> wrote: > On 9/12/23 18:28, rajani m wrote: > > Solr 9.1.1 version, upon restarting solr on any node in the cluster, a > > unique event is triggered across all the *other* nodes in the cluster > that > > has an impact similar to restarting solr on all the other nodes in the > > cluster. There is dip in the cpu usage, all the caches are emptied and > > warmed up, there are disk reads/writes on all the other nodes. > > How much RAM is in each node? How much is given to the Java heap? Are > you running more than one Solr instance on each node? How much disk > space do the indexes on each node consume? > > What are the counts of: > > * Nodes > * Collections > * Shards per collection > * Replica count per shard > * Documents per shard > > There is sometimes some confusion about replica count. I've seen people > say they have "one shard and one replica" when the right way to state it > is that the replica count is two. > > If the counts above are large (meaning that you have a LOT of cores) > then restarting a node can be very disruptive to the cloud as a whole. > See this issue from several years ago where I explored this: > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-7191&data=05%7C01%7CKarl.Stoney%40autotrader.co.uk%7C40d6f245b2f44aff6ae608dbb4855acd%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C638302257079643606%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CMSCfriAqQ0SGioYROaJqfGcnx%2F76w5OJ7cA1o5JxIU%3D&reserved=0<https://issues.apache.org/jira/browse/SOLR-7191> > > The issue has been marked as resolved in version 6.3.0, but no code was > modified, and as far as I know, the problem still exists. > > It's worth noting that in my tests for that issue, the collections were > empty. For collections that actually have data, the problem will be worse. > > If there are a lot of adds/updates/deletes happening, then the delta > between the replicas might exceed the threshold for transaction log > recovery. Solr may be doing a full replication to the cores on the > restarted node. But I would expect that to only affect the shard > leaders, which are the source for the replicated data. > > Thanks, > Shawn > > Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.