Nilesh Singh created SOLR-17788:
-----------------------------------
Summary: Cores Hang During Full Index Bootstrap in Leader/Replica
Mode
Key: SOLR-17788
URL: https://issues.apache.org/jira/browse/SOLR-17788
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: replication (java)
Affects Versions: 9.8.1
Environment: Kubernetes (K8s), Solr in Leader/Replica Mode, Custom
Solr Application
Reporter: Nilesh Singh
We've encountered a critical edge case bug in our Solr deployment when
operating in leader/replica mode. Our architecture includes a two-layer Solr
setup:
* *Indexer Layer:* Responsible for creating the index.
* *Repeater Layer:* Receives the created index and serves it to downstream
consumers.
Our custom Solr application, which runs multiple cores, polls the repeater
layer for replication. Throughout the day, the indexer performs incremental
indexing jobs. However, once daily, a bootstrap job runs to create a full index.
*Issue Observed:*
During the full bootstrap process, all cores in our custom Solr application
enter a hung state while attempting to replicate from the repeater layer. This
replication hang persists indefinitely and requires a manual restart of the
application pods to recover.
*Key Details:*
* Index optimization is *not* performed on the leader.
* The issue only occurs during {*}full index bootstrap{*}, not during
incremental indexing.
* The replication hang affects *all cores* across multiple instances of the
custom Solr application.
* The system is deployed in {*}Kubernetes{*}, with:
** 1 instance of the indexer
** Multiple instances of the repeater
** A large number of custom Solr application instances
*Impact:*
This bug causes a complete halt in data replication during full indexing
intermittently. random number of pods go into this mode, leading to stale or
missing data in the custom Solr application until manual intervention is
performed.
*Steps to Reproduce:*
# Run Solr in leader/replica mode with the described architecture.
# Perform incremental indexing throughout the day — replication works as
expected.
# Trigger a full index bootstrap job. We have large data set hence replication
from indexer to repeater take 3 to 5 mins.
# Observe that all custom Solr cores hang in replication mode.
*Expected Behavior:*
Cores should successfully replicate the full index from the repeater layer
without requiring a restart.
*Actual Behavior:*
Cores hang indefinitely during replication and require a restart to resume
normal operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]