Yep - I edited the two instances to remove those differences and re-ran a fairly tortuous testing cycle - 100 requests/sec.
Interestingly, 8.9.0 outperformed 8.3.1 significantly at this point and neither hit the threads limit or failed to create new threads. But both scaled up to well above 2k threads and even after stopping the test, the threads remained in a timed wait condition. So I'm still thinking there's a problem with threads not getting terminated somewhere. Test output below if it's of use: 8.3.1 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:20:25 Requests [total, rate, throughput] 6000, 100.02, 25.05 Duration [total, attack, wait] 1m29.97089157s, 59.990075461s, 29.980816109s Latencies [mean, 50, 95, 99, max] 11.011900024s, 5.175300857s, 30.00091013s, 30.001077592s, 30.014343222s Bytes In [total, mean] 42385846, 7064.31 Bytes Out [total, mean] 0, 0.00 Success [ratio] 37.57% Status Codes [code:count] 0:3741 200:2254 500:5 Error Set: 500 Server Error Wed Oct 13 12:20:27 UTC 2021 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:21:57 Requests [total, rate, throughput] 6000, 100.02, 10.98 Duration [total, attack, wait] 1m29.931457331s, 59.990859187s, 29.940598144s Latencies [mean, 50, 95, 99, max] 7.74211412s, 57.37µs, 30.000885208s, 30.00105703s, 30.011268185s Bytes In [total, mean] 21287225, 3547.87 Bytes Out [total, mean] 0, 0.00 Success [ratio] 16.45% Status Codes [code:count] 0:5007 200:987 500:6 Error Set: 500 Server Error Wed Oct 13 12:21:58 UTC 2021 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:23:27 Requests [total, rate, throughput] 6000, 100.02, 11.68 Duration [total, attack, wait] 1m29.58141878s, 59.990945585s, 29.590473195s Latencies [mean, 50, 95, 99, max] 9.05339281s, 58.742µs, 30.000880433s, 30.001052307s, 30.004624884s Bytes In [total, mean] 17434533, 2905.76 Bytes Out [total, mean] 0, 0.00 Success [ratio] 17.43% Status Codes [code:count] 0:4874 200:1046 500:80 Error Set: 500 Server Error context deadline exceeded (Client.Timeout or context cancellation while reading body) Wed Oct 13 12:23:29 UTC 2021 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:24:58 Requests [total, rate, throughput] 6000, 100.02, 1.70 Duration [total, attack, wait] 1m29.541198374s, 59.990833778s, 29.550364596s Latencies [mean, 50, 95, 99, max] 8.155874514s, 51.073µs, 30.000892751s, 30.001051441s, 30.009856373s Bytes In [total, mean] 2211649, 368.61 Bytes Out [total, mean] 0, 0.00 Success [ratio] 2.53% Status Codes [code:count] 0:5810 200:152 500:38 Error Set: 500 Server Error Wed Oct 13 12:24:59 UTC 2021 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:26:29 Requests [total, rate, throughput] 6000, 100.02, 0.73 Duration [total, attack, wait] 1m29.451307081s, 59.990601937s, 29.460705144s Latencies [mean, 50, 95, 99, max] 7.257427504s, 50.866µs, 30.000885386s, 30.001023192s, 30.009992128s Bytes In [total, mean] 837808, 139.63 Bytes Out [total, mean] 0, 0.00 Success [ratio] 1.08% Status Codes [code:count] 0:5907 200:65 500:28 Error Set: 500 Server Error context deadline exceeded (Client.Timeout or context cancellation while reading body) Wed Oct 13 12:26:30 UTC 2021 upload: ./831_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:28:00 Requests [total, rate, throughput] 6000, 100.02, 1.65 Duration [total, attack, wait] 1m29.760671622s, 59.990482467s, 29.770189155s Latencies [mean, 50, 95, 99, max] 8.288506559s, 50.424µs, 30.000894904s, 30.00104114s, 30.016506845s Bytes In [total, mean] 1973103, 328.85 Bytes Out [total, mean] 0, 0.00 Success [ratio] 2.47% Status Codes [code:count] 0:5821 200:148 500:31 Error Set: 500 Server Error 8.9.0 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:20:27 Requests [total, rate, throughput] 6000, 100.02, 28.50 Duration [total, attack, wait] 1m29.930332927s, 59.990809535s, 29.939523392s Latencies [mean, 50, 95, 99, max] 10.958871049s, 5.184437078s, 30.000885573s, 30.001088926s, 30.00946393s Bytes In [total, mean] 51095634, 8515.94 Bytes Out [total, mean] 0, 0.00 Success [ratio] 42.72% Status Codes [code:count] 0:3432 200:2563 500:5 Error Set: 500 Server Error Wed Oct 13 12:20:28 UTC 2021 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:21:58 Requests [total, rate, throughput] 6000, 100.02, 38.15 Duration [total, attack, wait] 1m29.951119527s, 59.990310608s, 29.960808919s Latencies [mean, 50, 95, 99, max] 10.089597212s, 4.856237339s, 30.000863563s, 30.001063417s, 30.013413816s Bytes In [total, mean] 64455304, 10742.55 Bytes Out [total, mean] 0, 0.00 Success [ratio] 57.20% Status Codes [code:count] 0:2565 200:3432 500:3 Error Set: 500 Server Error Wed Oct 13 12:22:00 UTC 2021 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:23:30 Requests [total, rate, throughput] 6000, 100.02, 31.12 Duration [total, attack, wait] 1m29.971513653s, 59.990882146s, 29.980631507s Latencies [mean, 50, 95, 99, max] 7.982882635s, 4.086936886s, 30.000611417s, 30.001022921s, 30.00949772s Bytes In [total, mean] 49838902, 8306.48 Bytes Out [total, mean] 0, 0.00 Success [ratio] 46.67% Status Codes [code:count] 0:3011 200:2800 500:4 503:185 Error Set: 500 Server Error unexpected EOF 503 Service Unavailable Wed Oct 13 12:23:31 UTC 2021 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:25:01 Requests [total, rate, throughput] 6000, 100.02, 30.21 Duration [total, attack, wait] 1m29.871026977s, 59.990066513s, 29.880960464s Latencies [mean, 50, 95, 99, max] 9.187948603s, 5.166973696s, 30.000723538s, 30.001043637s, 30.009522646s Bytes In [total, mean] 52508950, 8751.49 Bytes Out [total, mean] 0, 0.00 Success [ratio] 45.25% Status Codes [code:count] 0:3118 200:2715 500:4 503:163 Error Set: 500 Server Error unexpected EOF 503 Service Unavailable Wed Oct 13 12:25:02 UTC 2021 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:26:33 Requests [total, rate, throughput] 6000, 100.02, 28.33 Duration [total, attack, wait] 1m29.931899376s, 59.990583341s, 29.941316035s Latencies [mean, 50, 95, 99, max] 10.643567102s, 4.009658048s, 30.000888301s, 30.00107338s, 30.010319307s Bytes In [total, mean] 51869384, 8644.90 Bytes Out [total, mean] 0, 0.00 Success [ratio] 42.47% Status Codes [code:count] 0:3448 200:2548 500:4 Error Set: 500 Server Error context deadline exceeded (Client.Timeout or context cancellation while reading body) Wed Oct 13 12:26:34 UTC 2021 upload: ./890_tests_results.bin to s3://adzuna-files-stage/test_framework/solr/tests/2021-10-13:12:28:03 Requests [total, rate, throughput] 6000, 100.02, 40.46 Duration [total, attack, wait] 1m29.540373624s, 59.990568262s, 29.549805362s Latencies [mean, 50, 95, 99, max] 9.938485443s, 5.811191827s, 30.000761953s, 30.001041833s, 30.014534484s Bytes In [total, mean] 64483027, 10747.17 Bytes Out [total, mean] 0, 0.00 Success [ratio] 60.38% Status Codes [code:count] 0:2372 200:3623 500:5 Error Set: 500 Server Error On Wed, 13 Oct 2021 at 13:06, Deepak Goel <deic...@gmail.com> wrote: > Hello > > I can as of now see two changes: > > 1. -Xmx > > 2. +ExplicitGCInvokesConcurrent > > Deepak > "The greatness of a nation can be judged by the way its animals are treated > - Mahatma Gandhi" > > +91 73500 12833 > deic...@gmail.com > > Facebook: https://www.facebook.com/deicool > LinkedIn: www.linkedin.com/in/deicool > > "Plant a Tree, Go Green" > > Make In India : http://www.makeinindia.com/home > > > On Wed, Oct 13, 2021 at 5:09 PM Dominic Humphries > <domi...@adzuna.com.invalid> wrote: > > > CLI invocation for 8.3.1 is > > java -server -Xmx15826m -XX:+UseG1GC -XX:+PerfDisableSharedMem > > -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+UseLargePages > > -XX:+AlwaysPreTouch > > > > > -Xlog:gc*:file=/srv/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M > > -Dcom.sun.management.jmxremote > > -Dcom.sun.management.jmxremote.local.only=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.port=18983 > > -Dcom.sun.management.jmxremote.rmi.port=18983 > -Dsolr.log.dir=/srv/solr/logs > > -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks > -Duser.timezone=UTC > > -Djetty.home=/usr/local/solr/server -Dsolr.solr.home=/srv/solr/data > > -Dsolr.data.home= -Dsolr.install.dir=/usr/local/solr > > > -Dsolr.default.confdir=/usr/local/solr/server/solr/configsets/_default/conf > > -Dlog4j.configurationFile=file:/srv/solr/log4j2.xml > > -Dsolr.disable.shardsWhitelist=true -Xss256k -Dsolr.jetty.https.port=8983 > > -jar start.jar --module=http > > I believe the key items are: > > -XX:+AlwaysPreTouch > > -XX:+ParallelRefProcEnabled > > -XX:+PerfDisableSharedMem > > -XX:+UseG1GC > > -XX:+UseLargePages > > -XX:MaxGCPauseMillis=250 > > > > > -Xlog:gc*:file=/srv/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M > > -Xmx15826m > > -Xss256k > > > > And for 8.9.0 is > > java -server -Xmx7913m -XX:+UseG1GC -XX:+PerfDisableSharedMem > > -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=250 -XX:+UseLargePages > > -XX:+AlwaysPreTouch -XX:+ExplicitGCInvokesConcurrent > > > > > -Xlog:gc*:file=/srv/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M > > -Dsolr.jetty.inetaccess.includes= -Dsolr.jetty.inetaccess.excludes= > > -Dcom.sun.management.jmxremote > > -Dcom.sun.management.jmxremote.local.only=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.port=18983 > > -Dcom.sun.management.jmxremote.rmi.port=18983 > -Dsolr.log.dir=/srv/solr/logs > > -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks > -Duser.timezone=UTC > > -XX:-OmitStackTraceInFastThrow > > -XX:OnOutOfMemoryError=/usr/local/solr/bin/oom_solr.sh 8983 > /srv/solr/logs > > -Djetty.home=/usr/local/solr/server -Dsolr.solr.home=/srv/solr/data > > -Dsolr.data.home= -Dsolr.install.dir=/usr/local/solr > > > -Dsolr.default.confdir=/usr/local/solr/server/solr/configsets/_default/conf > > -Dlog4j.configurationFile=/srv/solr/log4j2.xml > > -Dsolr.disable.shardsWhitelist=true -Xss256k -jar start.jar --module=http > > Key: > > -XX:+AlwaysPreTouch > > -XX:+ExplicitGCInvokesConcurrent > > -XX:+ParallelRefProcEnabled > > -XX:+PerfDisableSharedMem > > -XX:+UseG1GC > > -XX:+UseLargePages > > -XX:-OmitStackTraceInFastThrow > > -XX:MaxGCPauseMillis=250 > > -XX:OnOutOfMemoryError=/usr/local/solr/bin/oom_solr.sh 8983 > /srv/solr/logs > > > > > -Xlog:gc*:file=/srv/solr/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M > > -Xmx7913m > > -Xss256k > > > > Xmx values are based on the instance RAM, currently they're running on > two > > different instance types but we see the same behaviour when they're on > > identical types too. > > > > Many thanks > > > > Dominic > > > > On Wed, 13 Oct 2021 at 12:07, Deepak Goel <deic...@gmail.com> wrote: > > > > > Hello > > > > > > Can you please tell us the JVM Heap Setting for both the versions: > 8.3.1, > > > 8.9.0? > > > > > > I will also have to look into the following code: > > FileFloatSource.java:210. > > > (will do it tonite-IST and update) > > > > > > Deepak > > > "The greatness of a nation can be judged by the way its animals are > > treated > > > - Mahatma Gandhi" > > > > > > +91 73500 12833 > > > deic...@gmail.com > > > > > > Facebook: https://www.facebook.com/deicool > > > LinkedIn: www.linkedin.com/in/deicool > > > > > > "Plant a Tree, Go Green" > > > > > > Make In India : http://www.makeinindia.com/home > > > > > > > > > On Wed, Oct 13, 2021 at 4:06 PM Dominic Humphries > > > <domi...@adzuna.com.invalid> wrote: > > > > > > > Oh, that's very helpful to know about, ty > > > > > > > > The overwhelming majority appear to be threads in TIMED_WAITING, all > > > > waiting on the same > > > > thing: > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3b315cbb > > > > > > > > I've attached a screenshot which includes the stack trace. Stopping > all > > > > queries to the instance and waiting didn't result in any noticeable > > > > decrease in the number of threads so it looks like despite being > timed, > > > > they're simply not getting terminated. > > > > > > > > Restarting the service takes me back down to just 53 threads; > > re-running > > > a > > > > test results in many new threads immediately coming into being, this > > time > > > > with a higher proportion of threads BLOCKED on > > > > > > > > > > org.apache.solr.search.function.FileFloatSource$CreationPlaceholder@37b782de > > > > - See second screenshot. The stack trace for those is too big for one > > > > screen so here's the output: > > > > > > > > qtp178604517-861 (861) > > > > > > > > > > > > > > > > > > org.apache.solr.search.function.FileFloatSource$CreationPlaceholder@37b782de > > > > > > > > - > > > > > > > > > > org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:210) > > > > - > > > > > > > > > > org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:158) > > > > - > > > > > > > > > > org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:97) > > > > - > > > > > > > > > > org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203) > > > > - > > > > > > > > > > org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:261) > > > > - > > > > > > > > > > org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:224) > > > > - org.apache.lucene.search.Weight.scorerSupplier(Weight.java:148) > > > > - > > > > > > > > > > org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379) > > > > - > > > org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344) > > > > - org.apache.lucene.search.Weight.bulkScorer(Weight.java:182) > > > > - > > > > > > > > org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:338) > > > > - > > > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:656) > > > > - > > > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443) > > > > - > > > > > > > > > > org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:211) > > > > - > > > > > > > > > > org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1705) > > > > - > > > > > > > > > > org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1408) > > > > - > > > > > > > > > > org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:596) > > > > - > > > > > > > > > > org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1500) > > > > - > > > > > > > > > > org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:390) > > > > - > > > > > > > > > > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:369) > > > > - > > > > > > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) > > > > - org.apache.solr.core.SolrCore.execute(SolrCore.java:2637) > > > > - > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:794) > > > > - org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:567) > > > > - > > > > > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) > > > > - > > > > > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357) > > > > - > > > > > > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) > > > > - > > > > > > > > > > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) > > > > - > > > > > > > > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > > > - > > > > > > > > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) > > > > - > > > > > > > > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) > > > > - > > > > > > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) > > > > - > > > > > > > > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > > > > - > > > > > > > > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) > > > > - > > > > > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) > > > > - org.eclipse.jetty.server.Server.handle(Server.java:516) > > > > - > > > > > > > > > > org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388) > > > > - > > > > > > org.eclipse.jetty.server.HttpChannel$$Lambda$556/0x000000080067a440.dispatch(Unknown > > > > Source) > > > > - > > org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633) > > > > - > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380) > > > > - > > > > > > > > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) > > > > - > > > > org.eclipse.jetty.io > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) > > > > - org.eclipse.jetty.io > .FillInterest.fillable(FillInterest.java:105) > > > > - org.eclipse.jetty.io > > > .ChannelEndPoint$1.run(ChannelEndPoint.java:104) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882) > > > > - > > > > > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036) > > > > - java.base@11.0.5/java.lang.Thread.run(Thread.java:834) > > > > > > > > [image: image.png] > > > > [image: image.png] > > > > > > > > On Wed, 13 Oct 2021 at 00:03, Joel Bernstein <joels...@gmail.com> > > wrote: > > > > > > > >> There is a thread dump on the Solr admin. You can use that to > > determine > > > >> what all those threads are doing and where they are getting stuck. > You > > > can > > > >> post parts of the thread dump back to this email thread as well. > > > >> > > > >> > > > >> > > > >> Joel Bernstein > > > >> http://joelsolr.blogspot.com/ > > > >> > > > >> > > > >> On Tue, Oct 12, 2021 at 11:15 AM Dominic Humphries > > > >> <domi...@adzuna.com.invalid> wrote: > > > >> > > > >> > We run 8.3.1 in prod without any problems, but we're having issues > > > with > > > >> > trying to upgrade. > > > >> > > > > >> > I've created an 8.9.0 leader & follower, imported our live data > into > > > it, > > > >> > and am testing it via replaying requests made to prod. We're > seeing > > a > > > >> big > > > >> > problem where fairly moderate request rates are causing the > instance > > > to > > > >> > become so slow it fails healthcheck. The logs showed a lot of > errors > > > >> around > > > >> > creating threads: > > > >> > > > > >> > solr[4507]: [124136.511s][warning][os,thread] Failed to start > > thread - > > > >> > pthread_create failed (EAGAIN) for attributes: stacksize: 256k, > > > >> guardsize: > > > >> > 0k, detached. > > > >> > > > > >> > WARN (qtp178604517-3891) [ ] o.e.j.i.ManagedSelector => > > > >> > java.lang.OutOfMemoryError: unable to create native thread: > possibly > > > >> out of > > > >> > memory or process/resource limits reached > > > >> > > > > >> > So I monitored thread count for the process whilst running the > test > > > >> suite > > > >> > and saw a persistent pattern: Threads increased until maxed out, > the > > > >> logs > > > >> > flooded with errors as it tried to create still more threads, and > > the > > > >> > instance slowed down until terminated as unhealthy. > > > >> > > > > >> > The DefaultTasksMax is set to 4915, I've tried raising and > lowering > > it > > > >> but > > > >> > regardless of value the result is the same: it gets maxed and > > > everything > > > >> > slows down. > > > >> > > > > >> > Is there anything I can do to stop solr spinning up so many > threads > > it > > > >> > ceases to function? There have been a few test passes where it > > > >> > spontaneously dropped threadcount from thousands to hundreds and > > > stayed > > > >> up > > > >> > longer, but there seems no pattern to when this happens. Running > the > > > >> same > > > >> > tests on 8.3.1 results in a much slower increase in threads and it > > > never > > > >> > quite maxes them so things continue to function. > > > >> > > > > >> > See below for the thread count and healthcheck times seen on a > > (fairly > > > >> > harsh) test run of 100 requests/sec > > > >> > > > > >> > Thanks > > > >> > > > > >> > Dominic > > > >> > > > > >> > > > > >> > Threadcount: > > > >> > > > > >> > ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; ps -eLF | grep > > > >> 'start.jar' > > > >> > | wc -l; sleep 10s; done > > > >> > Tue Oct 12 14:27:33 UTC 2021 > > > >> > 52 > > > >> > Tue Oct 12 14:27:43 UTC 2021 > > > >> > 52 > > > >> > Tue Oct 12 14:27:54 UTC 2021 > > > >> > 52 > > > >> > Tue Oct 12 14:28:04 UTC 2021 > > > >> > 52 > > > >> > Tue Oct 12 14:28:14 UTC 2021 > > > >> > 569 > > > >> > Tue Oct 12 14:28:24 UTC 2021 > > > >> > 899 > > > >> > Tue Oct 12 14:28:34 UTC 2021 > > > >> > 1198 > > > >> > Tue Oct 12 14:28:44 UTC 2021 > > > >> > 1589 > > > >> > Tue Oct 12 14:28:54 UTC 2021 > > > >> > 2016 > > > >> > Tue Oct 12 14:29:05 UTC 2021 > > > >> > 2451 > > > >> > Tue Oct 12 14:29:15 UTC 2021 > > > >> > 2851 > > > >> > Tue Oct 12 14:29:26 UTC 2021 > > > >> > 2934 > > > >> > Tue Oct 12 14:29:36 UTC 2021 > > > >> > 3249 > > > >> > Tue Oct 12 14:29:46 UTC 2021 > > > >> > 3501 > > > >> > Tue Oct 12 14:29:57 UTC 2021 > > > >> > 3734 > > > >> > Tue Oct 12 14:30:07 UTC 2021 > > > >> > 4128 > > > >> > Tue Oct 12 14:30:18 UTC 2021 > > > >> > 4374 > > > >> > Tue Oct 12 14:30:29 UTC 2021 > > > >> > 4637 > > > >> > Tue Oct 12 14:30:39 UTC 2021 > > > >> > 4693 > > > >> > Tue Oct 12 14:30:50 UTC 2021 > > > >> > 4807 > > > >> > Tue Oct 12 14:31:01 UTC 2021 > > > >> > 4916 > > > >> > Tue Oct 12 14:31:11 UTC 2021 > > > >> > 4916 > > > >> > Tue Oct 12 14:31:22 UTC 2021 > > > >> > Connection to 10.40.22.166 closed by remote host. > > > >> > > > > >> > > > > >> > Healthcheck: > > > >> > > > > >> > ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; curl -v > > > >> > localhost:8983/solr/ 2>&1 | grep HTTP; date; echo '----'; sleep > > > >> > 10s; done > > > >> > Tue Oct 12 14:27:34 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:27:34 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:27:44 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:27:44 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:27:54 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:27:54 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:28:04 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:28:04 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:28:14 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:28:16 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:28:26 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:28:39 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:28:49 UTC 2021 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 > > > --:--:-- > > > >> > 0> GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:23 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:29:13 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:29:23 UTC 2021 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 > > > --:--:-- > > > >> > 0> GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:29:25 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:29:35 UTC 2021 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 > > > --:--:-- > > > >> > 0> GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:29:44 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:29:54 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:30:06 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:30:16 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:30:20 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:30:30 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 > > > --:--:-- > > > >> > 0< HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:30:33 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:30:43 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:30:43 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:30:53 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > Tue Oct 12 14:30:55 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:31:05 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:31:05 UTC 2021 > > > >> > ---- > > > >> > Tue Oct 12 14:31:15 UTC 2021 > > > >> > > GET /solr/ HTTP/1.1 > > > >> > < HTTP/1.1 200 OK > > > >> > Tue Oct 12 14:31:15 UTC 2021 > > > >> > ---- > > > >> > Connection to 10.40.22.166 closed by remote host. > > > >> > > > > >> > > > > > > > > > >