tried to enable the -Dsolr.http1=true but it didn’t help. Seeing timeout after
180s (even without sending any traffic to the cluster) and also noticed
Caused by: java.util.concurrent.TimeoutException: Total timeout 600000
ms elapsed (stacktrace here https://justpaste.it/29bpv
<https://justpaste.it/29bpv>)
on some of the nodes.
Also, spotting errors related to:
o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory:
MMapDirectory@/var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata
(we do not use snapshots at all) (stacktrace https://justpaste.it/88en6
<https://justpaste.it/88en6> )
CoreIsClosedException o.a.s.u.CommitTracker auto commit error...:
https://justpaste.it/bbbms <https://justpaste.it/bbbms>
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error
from server at null https://justpaste.it/5nq7b <https://justpaste.it/5nq7b>
(this node is a leader)
From time to time observing in the logs (TLOG replicas across the board) across
multiple nodes:
WARN (indexFetcher-120-thread-1) [] o.a.s.h.IndexFetcher File _8ux.cfe did not
match. expected checksum is 3843994300 and actual is checksum 2148229542.
expected length is 542 and actual length is 542
> On 5. Dec 2022, at 5:12 PM, Houston Putman <[email protected]> wrote:
>
> I'm not sure this is the issue, but maybe its http2 vs http1.
>
> Could you retry with the following set on the cluster?
>
> -Dsolr.http1=true
>
>
>
> On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <[email protected]
> <mailto:[email protected]>>
> wrote:
>
>> Hello folks,
>>
>> We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded
>> from 8.11 to 9.0 (and eventually to 9.1).
>>
>> Fully reindexed collections after upgrade, all looking good, no errors,
>> response time improvements are noticed.
>>
>> We have the following specs:
>> collection size:
>> 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point;
>> shards: 6 shards, each ~4,7Gb; 1 core per node;
>> nodes:
>> 30Gi of RAM,
>> 16 cores
>> 96 nodes
>> Heap: 23Gb heap
>> JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr”
>> gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300
>> -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages
>> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2
>> -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10
>>
>>
>> Problem
>>
>> The problem we face is when we try to reload the collection, in sync mode
>> we’re getting timed out or forever running task if reload executed in async
>> mode:
>>
>> curl “reload” output: https://justpaste.it/ap4d2 <
>> https://justpaste.it/ap4d2 <https://justpaste.it/ap4d2>>
>> ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs
>> of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>
>> <https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>>
>>
>> There are no issues on a newly created cluster if there is no incoming
>> traffic to it. Once we start sending requests to the cluster, collection
>> reload becomes impossible. Other collections (smaller) within the same
>> cluster are reloading just fine.
>>
>> In some cases, on some node the Old generation GC is kicking in and makes
>> the entire cluster unstable, however, that doesn’t all the time when
>> collection reload is timing out.
>>
>> We’ve tried the rollback to 8.11 and everything works normally as it used
>> to be, no errors with reload, no other errors in the logs during reload,
>> etc.
>>
>> We tried the following:
>> run 9.0, 9.1 on Java 11 and Java 17: same result;
>> lower cache warming, disable firstSearcher queries: same result;
>> increase heap size, tune gc: same result;
>> use apiv1 and apiv2 to issue reload commands: no difference;
>> sync vs async reload: either forever running task or timing out after 180
>> seconds;
>>
>> Did anyone face similar issues after upgrading to version 9 of Solr? Could
>> you please advice where should we focus our attention while debugging this
>> behavior? Any other advices/suggestions?
>>
>> Thank you
>>
>>
>> Best regards,
>> Nick Vladiceanu