Hi,

  Thank you for looking into this one.

I tailed the logs and figured it is the new node that is the culprit, it is
failing search queries upon startup for sometime. The errors as seen below.
This query failure on that node and shard(s) is causing overall aggregated
search query to take about 7-10 seconds causing latency spikes and the
consequence of which load balancer is sending lower number of requests and
hence other nodes have dip in search qps there by dip in their resource
usage.

A node goes off and recovers fully, zk immediately starts sending the
search queries to it and fails x queries for sometime(2 minutes) before it
actually returns 200 responses to queries. What is the cause of the first
error, anyway to avoid it?

Thinking if there is a way to configure zk to not send traffic to that node
shards for a few minutes and warm it up with some queries before it starts
sending queries to it?


Error -

2023-09-13 17:50:34.277 ERROR (qtp517787604-178) [c:v9-web s:shard34
r:core_node772 x:v9-web_shard34_replica_n771] o.a.s.h.RequestHandlerBase
java.util.ConcurrentModificationException =>
java.util.ConcurrentModificationException
at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1221)
java.util.ConcurrentModificationException: null
at java.util.HashMap.computeIfAbsent(HashMap.java:1221) ~[?:?]
at
org.apache.solr.schema.IndexSchema.getPayloadDecoder(IndexSchema.java:2118)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$66.parse(ValueSourceParser.java:899)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:384)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$17.parse(ValueSourceParser.java:349)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:264)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSourceList(FunctionQParser.java:252)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$16.parse(ValueSourceParser.java:338)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:272)
~[?:?]
at
org.apache.solr.search.ValueSourceParser$DoubleParser.parse(ValueSourceParser.java:1646)
~[?:?]
at
org.apache.solr.search.FunctionQParser.parseValueSource(FunctionQParser.java:434)
~[?:?]
at org.apache.solr.search.FunctionQParser.parse(FunctionQParser.java:94)
~[?:?]
at org.apache.solr.search.QParser.getQuery(QParser.java:188) ~[?:?]
at
org.apache.solr.search.ExtendedDismaxQParser.getMultiplicativeBoosts(ExtendedDismaxQParser.java:532)
~[?:?]

and

query params  =>
org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The
request took too long to iterate over point values. Timeout: timeoutAt:
130698068079 (System.nanoTime(): 130791560015),
PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482)
org.apache.lucene.index.ExitableDirectoryReader$ExitingReaderException: The
request took too long to iterate over point values. Timeout: timeoutAt:
130698068079 (System.nanoTime(): 130791560015),
PointValues=org.apache.lucene.util.bkd.BKDReader@2b77d603
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.checkAndThrow(ExitableDirectoryReader.java:482)
~[?:?]
at
org.apache.lucene.index.ExitableDirectoryReader$ExitablePointValues.<init>(ExitableDirectoryReader.java:471)
~[?:?]
at
org.apache.lucene.index.ExitableDirectoryReader$ExitableFilterAtomicReader.getPointValues(ExitableDirectoryReader.java:85)
~[?:?]
at
org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator.<init>(NumericComparator.java:107)
~[?:?]
at
org.apache.lucene.search.comparators.LongComparator$LongLeafComparator.<init>(LongComparator.java:65)
~[?:?]
at
org.apache.lucene.search.comparators.LongComparator.getLeafComparator(LongComparator.java:58)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.doFieldSortValues(QueryComponent.java:523)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1661)
~[?:?]
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:421)
~[?:?]
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:420)
~[?:?]
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224)
~[?:?]
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2865) ~[?:?]
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:887)
~[?:?]
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:606) ~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:250)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:218)
~[?:?]
at
org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:257)
~[?:?]
at
org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:227)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213)
~[?:?]
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
~[?:?]
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
~[jetty-security-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
~[jetty-servlet-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
~[jetty-server-9.4.48.v20220622.jar:9.4.48.v20220622]

Thanks,
Rajani



On Wed, Sep 13, 2023 at 1:59 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 9/12/23 18:28, rajani m wrote:
> >    Solr 9.1.1 version, upon restarting solr on any node in the cluster, a
> > unique event is triggered across all the *other* nodes in the cluster
> that
> > has an impact similar to restarting solr on all the other nodes in the
> > cluster. There is dip in the cpu usage, all the caches are emptied and
> > warmed up, there are disk reads/writes on all the other nodes.
>
> How much RAM is in each node?  How much is given to the Java heap?  Are
> you running more than one Solr instance on each node?  How much disk
> space do the indexes on each node consume?
>
> What are the counts of:
>
> * Nodes
> * Collections
> * Shards per collection
> * Replica count per shard
> * Documents per shard
>
> There is sometimes some confusion about replica count.  I've seen people
> say they have "one shard and one replica" when the right way to state it
> is that the replica count is two.
>
> If the counts above are large (meaning that you have a LOT of cores)
> then restarting a node can be very disruptive to the cloud as a whole.
> See this issue from several years ago where I explored this:
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> The issue has been marked as resolved in version 6.3.0, but no code was
> modified, and as far as I know, the problem still exists.
>
> It's worth noting that in my tests for that issue, the collections were
> empty.  For collections that actually have data, the problem will be worse.
>
> If there are a lot of adds/updates/deletes happening, then the delta
> between the replicas might exceed the threshold for transaction log
> recovery.  Solr may be doing a full replication to the cores on the
> restarted node.  But I would expect that to only affect the shard
> leaders, which are the source for the replicated data.
>
> Thanks,
> Shawn
>
>

Reply via email to