Hi all,
> > I am currently facing a serious issue on our Solr 8.8.0 cluster. > The symptoms are : > - an increase in the number of threads > - especially a failure in Solr to really release the index's files. > > This brings us to fill the hard drive and to the server crash. > > > First analyse : > - Files Leak : > The files are considered deleted by Solr (in DEL mode in LSOF) but the > filesystem (via the LSOF command) shows them still present and assigned to > the user who started the Solr. > - Thread Leak (memory leak too) : > the qtp pool seems to increase and its threads stay in Time_waiting > > > Context: > We are a large french ecommerce site. > We use Solr for the product engine (~100-120 million of products). > Our indexing thread is permanent (even at night) and intense: 500 products > per second per instance. > Each Solr responds to between 50 (thanks to bots at night) and 400 > requests per second (a normal day). > So and this is one of the parameters of the problem, the instances do not > have a moment to breathe (even at night). > > Architecture : > This happens on our 2 types of architecture: Tlog - Tlog and Tlog - Pull > (we are in transition from the 1st to the 2nd). > And an another parameter of the problem: this bug happens on the > replication client: Pull or on Tlog follower. Never on a indexer. > > The bug has been occurring on all our production clusters for 2 months and > the installation of Solr 8.8.0 (replacing Solr 7.7.2). > > Reproducibility :Clearly on demand on a sandbox in about ten minutes. > > How to reproduce : > - full indexation 24h/24 > - files segments change every minute > - replication between tlog-pull with 00:00:10 delay > - permanent search (100 q/s): query + faceting with edismax (I removed all > specific uses : geolock, timeOut, caches for the bug research) > For the test, I have 10 000 search requests in a gatling in circular mode. > > After few minutes of load, we can see that LSOF command shows files leaks > with qtp thread. I mean that files have been deleted by IndexFetcher but > LSOF show them as DEL but they can't be removed because Solr keeps a > reference. > In the heap walker of Jprofiler, I can still find Strings with the file > name of this deleted files. Following the trail, that brings me back to > SolrCore, SolrIndexSearcher and qtp thread. > We can see too many Qtp threads locked like this one. > ### > "httpShardExecutor-7-thread-58586-processing-x:offers_TP_shard4_replica_p77 > r:core_node78 http://// > xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//|http:////xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17// > <http://xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//%7Chttp:////xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17//> > n:xxxx.cdweb.biz:8983_solr c:offers_TP s:shard4 [http://// > xxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//, http://// > xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17//]" - Thread > t@103812 > java.lang.Thread.State: TIMED_WAITING > at jdk.internal.misc.Unsafe.park(Native Method) > - parking to wait for <5e18a5a8> (a > java.util.concurrent.SynchronousQueue$TransferStack) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234) > at > java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:462) > at > java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:361) > at > java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:937) > at > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:829) > Locked ownable synchronizers: > - None > ### > > And if I stop the load, the number of qtp threads will decrease and the > index files locks will disappear. > > > So what I understand: > Because of permanent index changes, replications and full time searches, > SolrIndexSearchers are created and used until they become unnecessary, what > never happens. > There's not index change trigger to inform the 'search part' (qtp, > SolrIndexSearcher, SolrCore ...) that they work on a depreciated index. > RefCounter search lock have constant requests so they won't release > anything. > > > So my questions are : > - Does someone know the concepts of SolrIndexSearcher lifecycle and could > enlighten me on ? > - Did I understand well the general behavior of the interweaving of these > 2 mechanisms ? > - Does it mean that using replication requires some downtime to allow qtp > thread pool time to clean up ? > - What other alternative do we have? (We tried NRT, it does not hold the > load). > > NB : > if you use LSOF, you will see permanent small locks, more than I > described, directly at the beginning because of a lazy index files > releasement by IndexFetcher in 'fetchLatestIndex' method. > To better see the main bug, I force 'solrCore.closeSearcher()' in it so my > LSOF is clearly at 0 when it begins. > > Already thank you for reading me so far. > And I hope this will speak to someone > > Thanks > > Emmanuel > >