[jira] [Commented] (SOLR-14679) TLOGs grow forever, never get out of BUFFERING state

Jira Fri, 25 Mar 2022 06:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512394#comment-17512394
 ]


Viktor Molnár commented on SOLR-14679:
--------------------------------------

Well, I thought we want to commit the information in the TLOG, since the code 
is executed only if replica type is TLOG. And TLOG replica should keep it's own 
transaction log.

There are also other places in the code, where startReplicationFromLeader() is 
called with TRUE as the second argument and it's also related to TLOG replica 
type:
{code:java}
//RecoveryStrategy.java:
if (successfulRecovery) {
    log.info("Registering as Active after recovery.");
    try {
        if (replicaType == Replica.Type.TLOG) {
            zkController.startReplicationFromLeader(coreName, true);
        }
     ...


//ZkController.java
if (!didRecovery) {
    if (isTlogReplicaAndNotLeader) {
        startReplicationFromLeader(coreName, true);
    }
    ...


//CoreContainer.java
if (replica.getType() == Replica.Type.TLOG) {
    getZkController().stopReplicationFromLeader(core.getName());
    if (!cd.getCloudDescriptor().isLeader()) {
        getZkController().startReplicationFromLeader(newCore.getName(), true);
    }
    ...{code}
 

All other places, where second argument is FALSE are related to PULL replica 
type or recovery when transaction log is not required.

 

Could you please re-check it [~houston] ? Thank you

> TLOGs grow forever, never get out of BUFFERING state
> ----------------------------------------------------
>
>                 Key: SOLR-14679
>                 URL: https://issues.apache.org/jira/browse/SOLR-14679
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Priority: Major
>
> From the user's list, 
>  (Gael Jourdan-Weil)
> https://www.mail-archive.com/solr-user@lucene.apache.org/msg151867.html
> I think I've come down to the root cause of this mess in our case.
> Everything is confirming that the TLOG state is "BUFFERING" rather than 
> "ACTIVE".
> 1/ This can be seen with the metrics API as well where we observe:
> "TLOG.replay.remaining.bytes":48997506,
> "TLOG.replay.remaining.logs":1,
> "TLOG.state":1,
> 2/ When a hard commit occurs, we can see it in the logs and as the index 
> files are updated ; but we can also see that postCommit and preCommit 
> UpdateLog methods are called but exits immediately which looking at the code 
> indicates the state is "BUFFERING".
> So, why is this TLOG still in "BUFFERING" state?
> From the code, the only place where state is set to "BUFFERING" seems to be 
> UpdateLog.bufferUpdates.
> From the logs, in our case it comes from recovery process. We see the message 
> "Begin buffering updates. core=[col_blue_shard1]".
> Just after we can see "Publishing state of core [col_blue_shard1] as 
> recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am 
> [http://srv1/solr/col_blue_shard1/]";.
> Until here, everything is expected I guess but why the TLOG state is not set 
> to "ACTIVE" a bit later?
> Well, the "Begin buffering updates" occured and 500ms later we can see:
> - "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we 
> shut down srv2, this is our main cause of problem)
> - "I am going to be the leader srv1"
> - "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
> And 2s later:
> - "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] - 
> recoveringAfterStartup=[true]"
> - "Error while trying to recover. 
> core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get 
> fingerprint from leader"
> - "Finished recovery process, successful=[false]"
> At this point, I think the root cause on our side is a rolling update that we 
> did too quickly: we stopped node2 while node1 while recovering from it.
> It's still not clear how everything went back to "active" state after such a 
> failed recovery and a TLOG still in "BUFFERING".
> We shouldn't have been in recovery in the first place and I think we know 
> why, this is a first thing that we have adressed.
> Then we need to add some pauses in our rolling update strategy.
> Does it makes sense? Can you think of something else to check/improve?
> Best Regards,
> Gaël



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-14679) TLOGs grow forever, never get out of BUFFERING state

Reply via email to