I'm not sure this is the issue, but maybe its http2 vs http1. Could you retry with the following set on the cluster?
-Dsolr.http1=true On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <vladicean...@gmail.com> wrote: > Hello folks, > > We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded > from 8.11 to 9.0 (and eventually to 9.1). > > Fully reindexed collections after upgrade, all looking good, no errors, > response time improvements are noticed. > > We have the following specs: > collection size: > 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point; > shards: 6 shards, each ~4,7Gb; 1 core per node; > nodes: > 30Gi of RAM, > 16 cores > 96 nodes > Heap: 23Gb heap > JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr” > gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300 > -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages > -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2 > -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10 > > > Problem > > The problem we face is when we try to reload the collection, in sync mode > we’re getting timed out or forever running task if reload executed in async > mode: > > curl “reload” output: https://justpaste.it/ap4d2 < > https://justpaste.it/ap4d2> > ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs > of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw> > > There are no issues on a newly created cluster if there is no incoming > traffic to it. Once we start sending requests to the cluster, collection > reload becomes impossible. Other collections (smaller) within the same > cluster are reloading just fine. > > In some cases, on some node the Old generation GC is kicking in and makes > the entire cluster unstable, however, that doesn’t all the time when > collection reload is timing out. > > We’ve tried the rollback to 8.11 and everything works normally as it used > to be, no errors with reload, no other errors in the logs during reload, > etc. > > We tried the following: > run 9.0, 9.1 on Java 11 and Java 17: same result; > lower cache warming, disable firstSearcher queries: same result; > increase heap size, tune gc: same result; > use apiv1 and apiv2 to issue reload commands: no difference; > sync vs async reload: either forever running task or timing out after 180 > seconds; > > Did anyone face similar issues after upgrading to version 9 of Solr? Could > you please advice where should we focus our attention while debugging this > behavior? Any other advices/suggestions? > > Thank you > > > Best regards, > Nick Vladiceanu