Nilesh Singh created SOLR-17788:
-----------------------------------

             Summary: Cores Hang During Full Index Bootstrap in Leader/Replica 
Mode 
                 Key: SOLR-17788
                 URL: https://issues.apache.org/jira/browse/SOLR-17788
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: replication (java)
    Affects Versions: 9.8.1
         Environment: Kubernetes (K8s), Solr in Leader/Replica Mode, Custom 
Solr Application
            Reporter: Nilesh Singh


We've encountered a critical edge case bug in our Solr deployment when 
operating in leader/replica mode. Our architecture includes a two-layer Solr 
setup:
 * *Indexer Layer:* Responsible for creating the index.
 * *Repeater Layer:* Receives the created index and serves it to downstream 
consumers.

Our custom Solr application, which runs multiple cores, polls the repeater 
layer for replication. Throughout the day, the indexer performs incremental 
indexing jobs. However, once daily, a bootstrap job runs to create a full index.

*Issue Observed:*

During the full bootstrap process, all cores in our custom Solr application 
enter a hung state while attempting to replicate from the repeater layer. This 
replication hang persists indefinitely and requires a manual restart of the 
application pods to recover.

*Key Details:*
 * Index optimization is *not* performed on the leader.
 * The issue only occurs during {*}full index bootstrap{*}, not during 
incremental indexing.
 * The replication hang affects *all cores* across multiple instances of the 
custom Solr application.
 * The system is deployed in {*}Kubernetes{*}, with:
 ** 1 instance of the indexer
 ** Multiple instances of the repeater
 ** A large number of custom Solr application instances

*Impact:*

This bug causes a complete halt in data replication during full indexing 
intermittently. random number of pods go into this mode, leading to stale or 
missing data in the custom Solr application until manual intervention is 
performed.

*Steps to Reproduce:*
 # Run Solr in leader/replica mode with the described architecture.
 # Perform incremental indexing throughout the day — replication works as 
expected.
 # Trigger a full index bootstrap job. We have large data set hence replication 
from indexer to repeater take 3 to 5 mins.
 # Observe that all custom Solr cores hang in replication mode.

*Expected Behavior:*

Cores should successfully replicate the full index from the repeater layer 
without requiring a restart.

*Actual Behavior:*

Cores hang indefinitely during replication and require a restart to resume 
normal operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to