Re: Core reload timeout on Solr 9

Houston Putman Thu, 19 Jan 2023 06:58:26 -0800

>
> I was wondering, could it be something wrong with the solrconfig.xml
> parameters? Perhaps, a combination of parameters does not behave stable? Do
> you think it makes sense to go with a vanilla solrconfig.xml and introduce
> all the custom options one-by-one (i.e. ShardHandlerFactory, etc.)?



That is a great idea. (Obviously with the operator you need to keep some of
the values there that it relies on, but I think everything it uses is
vanilla starting with Solr 9)

- Houston

On Thu, Jan 19, 2023 at 9:43 AM Nick Vladiceanu <vladicean...@gmail.com>
wrote:

> Thanks Kevin for looking into it.
>
> I’ll answer the questions in the original order:
> * Pod volume has the correct permissions. Basically, we use emptyDir
> provisioned by the solr-operator. All the nodes are having exactly the same
> setup. No pods are co-located on the same worker node. No more than one
> Solr core is located on the same node.
> * We are actively indexing and querying. We do also use partial updates.
> Since we use TLOG replica types, we have a hard commit of 180s that opens a
> new searcher.
> * JDK 11 and JDK 17 behaves the same way. We were able to reproduce on
> both builds.
>
> As per directory exceptions, I also cannot understand why it is throwing
> that Unknown Directory exception. I have logged in into a Solr pod that was
> throwing this error and was able to find the exact location on the disk
> existing.
>
> When reload fails, sometimes it might fail on one node, other times fail
> on multiple nodes at the same time. I was checking all the logs on the k8s
> node and on the pod but couldn’t find anything related to the disk,
> network, or other errors.
>
> I was wondering, could it be something wrong with the solrconfig.xml
> parameters? Perhaps, a combination of parameters does not behave stable? Do
> you think it makes sense to go with a vanilla solrconfig.xml and introduce
> all the custom options one-by-one (i.e. ShardHandlerFactory, etc.)?
>
> ---
> Nick Vladiceanu
> vladicean...@gmail.com
>
>
>
>
> > On 18. Jan 2023, at 18:41, Kevin Risden <kris...@apache.org> wrote:
> >
> > So I am going to share some ideas just in case it triggers something - I
> > have this gut feel that the cores are closing due to an exception of some
> > kind. It seems like a lot of the issue is either index corruption or
> > "SolrCoreState already closed."
> >
> > * Does the pod volume have the correct permissions for Solr to
> read/write?
> > * Are you indexing these nodes or just querying? (asking this if these
> are
> > meant to be read only that would be different than a changing index)
> > * Have you taken into account
> > https://issues.apache.org/jira/browse/SOLR-16463 by chance if you have a
> > custom Docker image? (this might not be necessary since you say it
> > reproduces on JDK 11)
> >
> > I found this part of your update the most intriguing. Why would changing
> > the directory factory change this? My understanding is everything under
> > "/var/solr/data/my_collection_shard3_replica_t1643" should be controlled
> by
> > Solr both read/write so any directories underneath would be created
> > automatically.
> >
> > directoryFactory:
> >> https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core
> >> /MMapDirectoryFactory.html throwing the following exception:
> >> o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory:
> >> MMapDirectory@
> /var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata
> >> (we do not use snapshots at all) (stack trace
> https://justpaste.it/88en6)
> >> Switched to
> https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core
> >> /StandardDirectoryFactory.html; problem solved, no more Unknown
> directory
> >> exceptions
> >> Reload won’t fail on some nodes with Unknown directory exception;
> >> Result: reload still timing out, fewer exceptions;
> >
> >
> > My guess is that the reload is going to some node and that one node is
> > causing the whole process to timeout. If you find that node then you
> should
> > be able to collect the logs and see. Basically there is some reason the
> > cores are closing and its not good. I would guess the collection reload
> > timing out is just a symptom of whatever the bigger underlying cause is.
> >
> > Kevin Risden
> >
> >
> > On Wed, Dec 21, 2022 at 5:57 AM Nick Vladiceanu <vladicean...@gmail.com>
> > wrote:
> >
> >> yes, it’s very unusual. On Solr 8.11 (and previous versions) with setup
> >> and size of data, reload takes just a few seconds.
> >>
> >> We are having 6 shards spread across 96 replicas. Each replica is hosted
> >> on a dedicated EC2 instance, no more than one replica present on the
> same
> >> machine (in k8s words, it's one pod per node, one solr replica per pod).
> >>
> >> I am able to reproduce the reload issue on Solr 9.0 and 9.1. Tried to
> >> isolate the underlying node along with the Solr Pod, couldn’t identify
> and
> >> issues like high load, iowait, whatsoever. Only issues I see are
> exceptions
> >> in the Solr logs that never recover unless Pods are restarted (we use
> empty
> >> dir and not persistent value, every time a pod is restarted, the cores
> that
> >> were hosted on it are removed and when the pod comes back it gets
> allocated
> >> to the same shard or another shard, depending on how many replicas other
> >> shards have, we try to keep balance in # of replicas)
> >>
> >> I agree that 23GB of heap is a bit too much and are doing some work to
> >> optimize it (resizing caches, etc.). We tried to lower the heap to 20GB
> >> already and GC performance is better, and in general Solr performs
> better.
> >> I must mention that we have the same heap size in Solr 8.11 and it
> doesn’t
> >> cause any issues with the reload. Could it have an impact on Solr 9,
> >> somehow?
> >>
> >> Thank you a lot for sharing your thoughts, especially for explaining GC
> >> params and sharing yours, very much appreciated.
> >>
> >> Do you have any ideas on what we should try more? Here is the digest of
> >> what we have tried, but without any success:
> >>
> >> Zookeeper: Upgrade from 3.6 to 3.7 and 3.8: no impact;
> >>
> >> DNS: Solr pods joining and communicating over Pod IP instead of Pod Svc
> >> DNS name (headless). This was done in order to avoid any potential
> issues
> >> (even though CoreDNS/nodelocaldns metrics looked Ok) with DNS
> resolvers; no
> >> impact;
> >>
> >> Lucene: upgrade to Lucene 9.1.0, 9.2.0, 9.3.0; no impact;
> >>
> >> Solr Nodes:
> >> version:
> >> tried Solr 9.0.0 and Solr 9.1.0
> >> Result: no difference;
> >> Heap:
> >> recalculate the Heap size;
> >> reduce the size by 3GB (15%) in combination with caches resize (see
> below);
> >> Result: better performance, no old GC is triggered; cluster more stable;
> >> reload still timing out;
> >> TLOG and PULL:
> >> tested with 3TLOG replicas per shard, the rest 12 replicas PULL;
> >> tested all 15 replicas per shard of type TLOG;
> >> NRT is not an option at all, didn’t even try to test;
> >> Result: better response time with PULL, no impact on reload;
> >> Other tunings, including gc; no impact;
> >>
> >> solrconfig.xml:
> >> directoryFactory:
> >>
> >>
> https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core/MMapDirectoryFactory.html
> >> throwing the following exception:  o.a.s.c.SolrCore
> >> java.lang.IllegalArgumentException: Unknown directory: MMapDirectory@
> /var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata
> >> (we do not use snapshots at all) (stack trace
> https://justpaste.it/88en6)
> >> Switched to
> >>
> https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core/StandardDirectoryFactory.html
> ;
> >> problem solved, no more Unknown directory exceptions
> >> Reload won’t fail on some nodes with Unknown directory exception;
> >> Result: reload still timing out, fewer exceptions;
> >> lockType:
> >> Switched between “native” and “simple” lock type;
> >> Result: no impact;
> >> HttpShardHandlerFactory
> >> Increased timeout by 40% for cross-shards communication while doing
> >> queries;
> >> Result: no impact;
> >> filterCache,queryResultCache and documentCache:
> >> Limit the size of the caches to Megabyte instead of entries:
> >> filterCache - 1024MB
> >> queryResultCache - 1024MB
> >> documentCache - 2048MB
> >> Result: nodes are more stable during reload, cluster is not
> destabilizing,
> >> no old GC activity; better response time, less pressure on GC;
> >> circuitBreaker:
> >> disabling circuitBreaker;
> >> Result: no impact;
> >>
> >> ---
> >> Nick Vladiceanu
> >> vladicean...@gmail.com
> >>
> >>
> >>
> >>
> >>> On 20. Dec 2022, at 15:58, Shawn Heisey <apa...@elyograg.org> wrote:
> >>>
> >>> On 12/20/22 06:34, Nick Vladiceanu wrote:
> >>>> Thank you Shawn for sharing, indeed useful information.
> >>>> However, I must say that we only used deleteById and never
> >> deleteByQuery. We also only rely on the auto segment merging and not
> >> issuing optimize command.
> >>>
> >>> That is very unusual.  I've never seen a core reload take more than a
> >> few seconds, even when I was dealing with core sizes of double-digit GB.
> >> Unless you have hundreds or thousands of replicas for each of your 6
> >> shards, it really should complete very quickly.
> >>>
> >>> Have you been able to determine which Solr cores in the collection are
> >> causing the delay, and take a look at those machines?
> >>>
> >>> Some thoughts:
> >>>
> >>> When you said 96 nodes, were you talking about Solr instances or
> >> servers?  You really should only run one Solr instance per server,
> >> especially for a small index like this.
> >>>
> >>> A 23GB heap seems very excessive for a 4.7GB index that has less than 4
> >> million documents.  I'm sure you can reduce that by a lot and encounter
> >> smaller GC pauses as a result.  If you can share your GC logs, I should
> be
> >> able to provide a recommendation.
> >>>
> >>> I've been looking at what MinHeapFreeRatio and MaxHeapFreeRatio do.
> >> Those settings are probably unnecessary.  This is what I currently use
> for
> >> GC tuning on JDK 11 or JDK 17.  This produces EXTREMELY short collection
> >> pauses, but I have noticed that throughput-heavy things like indexing
> run a
> >> bit slower, but if the indexing is multi-threaded, I think that it would
> >> not be affected a lot.
> >>>
> >>> GC_TUNE=" \
> >>> -XX:+UnlockExperimentalVMOptions \
> >>> -XX:+UseZGC \
> >>> -XX:+ParallelRefProcEnabled \
> >>> -XX:+ExplicitGCInvokesConcurrent \
> >>> -XX:+UseStringDeduplication \
> >>> -XX:+AlwaysPreTouch \
> >>> -XX:+UseNUMA \
> >>> "
> >>>
> >>> ZGC has one unexpected disadvantage.  Using it will disable Compressed
> >> OOPs -- meaning that even with a heap smaller than 32GB, it uses 64 bit
> >> pointers.  This hasn't really impacted me ... the index is so small that
> >> with a 1GB heap I have more than enough.  If low pauses are the most
> >> important thing you need from GC and you're running at least JDK11, I
> would
> >> strongly recommend ZGC.  It does make indexing slower for me -- a full
> >> rebuild that takes 10 minutes with G1 takes 11 minutes with ZGC. But
> even
> >> the worst-case GC pauses are single-digit milliseconds.
> >>>
> >>> For G1GC, which is still the best option for JDK8, this is what I used
> >> to have:
> >>>
> >>> #GC_TUNE=" \
> >>> #  -XX:+UseG1GC \
> >>> #  -XX:+ParallelRefProcEnabled \
> >>> #  -XX:MaxGCPauseMillis=100 \
> >>> #  -XX:+ExplicitGCInvokesConcurrent \
> >>> #  -XX:+UseStringDeduplication \
> >>> #  -XX:+AlwaysPreTouch \
> >>> #  -XX:+UseNUMA \
> >>> #"
> >>>
> >>> Thanks,
> >>> Shawn
> >>
> >>
>
>

Re: Core reload timeout on Solr 9

Reply via email to