Just read through this, and don't yet have any concrete ideas better than what's been given, but I'm interested to clarify one thing you said:
We are having 6 shards spread across 96 replicas. Each replica is hosted on > a dedicated EC2 instance, no more than one replica present on the same > machine > Is that implying 6x96 physical machines ( = 576 pieces of hardware? ) or are you overlapping replicas for different shards on the same machine (= 576 processes on 96 bits of hardware) ? or overlapping on the same node (96 processes on 96 bits of hardware)? The last one is much more common. If you've really got 576 java processes running solr, that's a fair bit of communication that needs to happen as replicas go up and down. Have you observed any slowness on zookeeper during these episodes? On Thu, Jan 19, 2023 at 9:58 AM Houston Putman <hous...@apache.org> wrote: > > > > I was wondering, could it be something wrong with the solrconfig.xml > > parameters? Perhaps, a combination of parameters does not behave stable? > Do > > you think it makes sense to go with a vanilla solrconfig.xml and > introduce > > all the custom options one-by-one (i.e. ShardHandlerFactory, etc.)? > > > That is a great idea. (Obviously with the operator you need to keep some of > the values there that it relies on, but I think everything it uses is > vanilla starting with Solr 9) > > - Houston > > On Thu, Jan 19, 2023 at 9:43 AM Nick Vladiceanu <vladicean...@gmail.com> > wrote: > > > Thanks Kevin for looking into it. > > > > I’ll answer the questions in the original order: > > * Pod volume has the correct permissions. Basically, we use emptyDir > > provisioned by the solr-operator. All the nodes are having exactly the > same > > setup. No pods are co-located on the same worker node. No more than one > > Solr core is located on the same node. > > * We are actively indexing and querying. We do also use partial updates. > > Since we use TLOG replica types, we have a hard commit of 180s that > opens a > > new searcher. > > * JDK 11 and JDK 17 behaves the same way. We were able to reproduce on > > both builds. > > > > As per directory exceptions, I also cannot understand why it is throwing > > that Unknown Directory exception. I have logged in into a Solr pod that > was > > throwing this error and was able to find the exact location on the disk > > existing. > > > > When reload fails, sometimes it might fail on one node, other times fail > > on multiple nodes at the same time. I was checking all the logs on the > k8s > > node and on the pod but couldn’t find anything related to the disk, > > network, or other errors. > > > > I was wondering, could it be something wrong with the solrconfig.xml > > parameters? Perhaps, a combination of parameters does not behave stable? > Do > > you think it makes sense to go with a vanilla solrconfig.xml and > introduce > > all the custom options one-by-one (i.e. ShardHandlerFactory, etc.)? > > > > --- > > Nick Vladiceanu > > vladicean...@gmail.com > > > > > > > > > > > On 18. Jan 2023, at 18:41, Kevin Risden <kris...@apache.org> wrote: > > > > > > So I am going to share some ideas just in case it triggers something - > I > > > have this gut feel that the cores are closing due to an exception of > some > > > kind. It seems like a lot of the issue is either index corruption or > > > "SolrCoreState already closed." > > > > > > * Does the pod volume have the correct permissions for Solr to > > read/write? > > > * Are you indexing these nodes or just querying? (asking this if these > > are > > > meant to be read only that would be different than a changing index) > > > * Have you taken into account > > > https://issues.apache.org/jira/browse/SOLR-16463 by chance if you > have a > > > custom Docker image? (this might not be necessary since you say it > > > reproduces on JDK 11) > > > > > > I found this part of your update the most intriguing. Why would > changing > > > the directory factory change this? My understanding is everything under > > > "/var/solr/data/my_collection_shard3_replica_t1643" should be > controlled > > by > > > Solr both read/write so any directories underneath would be created > > > automatically. > > > > > > directoryFactory: > > >> https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core > > >> /MMapDirectoryFactory.html throwing the following exception: > > >> o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown > directory: > > >> MMapDirectory@ > > /var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata > > >> (we do not use snapshots at all) (stack trace > > https://justpaste.it/88en6) > > >> Switched to > > https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core > > >> /StandardDirectoryFactory.html; problem solved, no more Unknown > > directory > > >> exceptions > > >> Reload won’t fail on some nodes with Unknown directory exception; > > >> Result: reload still timing out, fewer exceptions; > > > > > > > > > My guess is that the reload is going to some node and that one node is > > > causing the whole process to timeout. If you find that node then you > > should > > > be able to collect the logs and see. Basically there is some reason the > > > cores are closing and its not good. I would guess the collection reload > > > timing out is just a symptom of whatever the bigger underlying cause > is. > > > > > > Kevin Risden > > > > > > > > > On Wed, Dec 21, 2022 at 5:57 AM Nick Vladiceanu < > vladicean...@gmail.com> > > > wrote: > > > > > >> yes, it’s very unusual. On Solr 8.11 (and previous versions) with > setup > > >> and size of data, reload takes just a few seconds. > > >> > > >> We are having 6 shards spread across 96 replicas. Each replica is > hosted > > >> on a dedicated EC2 instance, no more than one replica present on the > > same > > >> machine (in k8s words, it's one pod per node, one solr replica per > pod). > > >> > > >> I am able to reproduce the reload issue on Solr 9.0 and 9.1. Tried to > > >> isolate the underlying node along with the Solr Pod, couldn’t identify > > and > > >> issues like high load, iowait, whatsoever. Only issues I see are > > exceptions > > >> in the Solr logs that never recover unless Pods are restarted (we use > > empty > > >> dir and not persistent value, every time a pod is restarted, the cores > > that > > >> were hosted on it are removed and when the pod comes back it gets > > allocated > > >> to the same shard or another shard, depending on how many replicas > other > > >> shards have, we try to keep balance in # of replicas) > > >> > > >> I agree that 23GB of heap is a bit too much and are doing some work to > > >> optimize it (resizing caches, etc.). We tried to lower the heap to > 20GB > > >> already and GC performance is better, and in general Solr performs > > better. > > >> I must mention that we have the same heap size in Solr 8.11 and it > > doesn’t > > >> cause any issues with the reload. Could it have an impact on Solr 9, > > >> somehow? > > >> > > >> Thank you a lot for sharing your thoughts, especially for explaining > GC > > >> params and sharing yours, very much appreciated. > > >> > > >> Do you have any ideas on what we should try more? Here is the digest > of > > >> what we have tried, but without any success: > > >> > > >> Zookeeper: Upgrade from 3.6 to 3.7 and 3.8: no impact; > > >> > > >> DNS: Solr pods joining and communicating over Pod IP instead of Pod > Svc > > >> DNS name (headless). This was done in order to avoid any potential > > issues > > >> (even though CoreDNS/nodelocaldns metrics looked Ok) with DNS > > resolvers; no > > >> impact; > > >> > > >> Lucene: upgrade to Lucene 9.1.0, 9.2.0, 9.3.0; no impact; > > >> > > >> Solr Nodes: > > >> version: > > >> tried Solr 9.0.0 and Solr 9.1.0 > > >> Result: no difference; > > >> Heap: > > >> recalculate the Heap size; > > >> reduce the size by 3GB (15%) in combination with caches resize (see > > below); > > >> Result: better performance, no old GC is triggered; cluster more > stable; > > >> reload still timing out; > > >> TLOG and PULL: > > >> tested with 3TLOG replicas per shard, the rest 12 replicas PULL; > > >> tested all 15 replicas per shard of type TLOG; > > >> NRT is not an option at all, didn’t even try to test; > > >> Result: better response time with PULL, no impact on reload; > > >> Other tunings, including gc; no impact; > > >> > > >> solrconfig.xml: > > >> directoryFactory: > > >> > > >> > > > https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core/MMapDirectoryFactory.html > > >> throwing the following exception: o.a.s.c.SolrCore > > >> java.lang.IllegalArgumentException: Unknown directory: MMapDirectory@ > > /var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata > > >> (we do not use snapshots at all) (stack trace > > https://justpaste.it/88en6) > > >> Switched to > > >> > > > https://solr.apache.org/docs/9_1_0/core/org/apache/solr/core/StandardDirectoryFactory.html > > ; > > >> problem solved, no more Unknown directory exceptions > > >> Reload won’t fail on some nodes with Unknown directory exception; > > >> Result: reload still timing out, fewer exceptions; > > >> lockType: > > >> Switched between “native” and “simple” lock type; > > >> Result: no impact; > > >> HttpShardHandlerFactory > > >> Increased timeout by 40% for cross-shards communication while doing > > >> queries; > > >> Result: no impact; > > >> filterCache,queryResultCache and documentCache: > > >> Limit the size of the caches to Megabyte instead of entries: > > >> filterCache - 1024MB > > >> queryResultCache - 1024MB > > >> documentCache - 2048MB > > >> Result: nodes are more stable during reload, cluster is not > > destabilizing, > > >> no old GC activity; better response time, less pressure on GC; > > >> circuitBreaker: > > >> disabling circuitBreaker; > > >> Result: no impact; > > >> > > >> --- > > >> Nick Vladiceanu > > >> vladicean...@gmail.com > > >> > > >> > > >> > > >> > > >>> On 20. Dec 2022, at 15:58, Shawn Heisey <apa...@elyograg.org> wrote: > > >>> > > >>> On 12/20/22 06:34, Nick Vladiceanu wrote: > > >>>> Thank you Shawn for sharing, indeed useful information. > > >>>> However, I must say that we only used deleteById and never > > >> deleteByQuery. We also only rely on the auto segment merging and not > > >> issuing optimize command. > > >>> > > >>> That is very unusual. I've never seen a core reload take more than a > > >> few seconds, even when I was dealing with core sizes of double-digit > GB. > > >> Unless you have hundreds or thousands of replicas for each of your 6 > > >> shards, it really should complete very quickly. > > >>> > > >>> Have you been able to determine which Solr cores in the collection > are > > >> causing the delay, and take a look at those machines? > > >>> > > >>> Some thoughts: > > >>> > > >>> When you said 96 nodes, were you talking about Solr instances or > > >> servers? You really should only run one Solr instance per server, > > >> especially for a small index like this. > > >>> > > >>> A 23GB heap seems very excessive for a 4.7GB index that has less > than 4 > > >> million documents. I'm sure you can reduce that by a lot and > encounter > > >> smaller GC pauses as a result. If you can share your GC logs, I > should > > be > > >> able to provide a recommendation. > > >>> > > >>> I've been looking at what MinHeapFreeRatio and MaxHeapFreeRatio do. > > >> Those settings are probably unnecessary. This is what I currently use > > for > > >> GC tuning on JDK 11 or JDK 17. This produces EXTREMELY short > collection > > >> pauses, but I have noticed that throughput-heavy things like indexing > > run a > > >> bit slower, but if the indexing is multi-threaded, I think that it > would > > >> not be affected a lot. > > >>> > > >>> GC_TUNE=" \ > > >>> -XX:+UnlockExperimentalVMOptions \ > > >>> -XX:+UseZGC \ > > >>> -XX:+ParallelRefProcEnabled \ > > >>> -XX:+ExplicitGCInvokesConcurrent \ > > >>> -XX:+UseStringDeduplication \ > > >>> -XX:+AlwaysPreTouch \ > > >>> -XX:+UseNUMA \ > > >>> " > > >>> > > >>> ZGC has one unexpected disadvantage. Using it will disable > Compressed > > >> OOPs -- meaning that even with a heap smaller than 32GB, it uses 64 > bit > > >> pointers. This hasn't really impacted me ... the index is so small > that > > >> with a 1GB heap I have more than enough. If low pauses are the most > > >> important thing you need from GC and you're running at least JDK11, I > > would > > >> strongly recommend ZGC. It does make indexing slower for me -- a full > > >> rebuild that takes 10 minutes with G1 takes 11 minutes with ZGC. But > > even > > >> the worst-case GC pauses are single-digit milliseconds. > > >>> > > >>> For G1GC, which is still the best option for JDK8, this is what I > used > > >> to have: > > >>> > > >>> #GC_TUNE=" \ > > >>> # -XX:+UseG1GC \ > > >>> # -XX:+ParallelRefProcEnabled \ > > >>> # -XX:MaxGCPauseMillis=100 \ > > >>> # -XX:+ExplicitGCInvokesConcurrent \ > > >>> # -XX:+UseStringDeduplication \ > > >>> # -XX:+AlwaysPreTouch \ > > >>> # -XX:+UseNUMA \ > > >>> #" > > >>> > > >>> Thanks, > > >>> Shawn > > >> > > >> > > > > > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)