[ https://issues.apache.org/jira/browse/SOLR-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622992#comment-17622992 ]
Gaël Jourdan edited comment on SOLR-14679 at 10/24/22 8:03 PM: --------------------------------------------------------------- FYI we're still having the issue on a regular basis (Solr 8.11.1) but we only have NRT replicas. was (Author: gjourdan): FYI we're still having the issue on a regular basis (Solr 8.11.1). > TLOGs grow forever, never get out of BUFFERING state > ---------------------------------------------------- > > Key: SOLR-14679 > URL: https://issues.apache.org/jira/browse/SOLR-14679 > Project: Solr > Issue Type: Improvement > Reporter: Erick Erickson > Priority: Major > > From the user's list, > (Gael Jourdan-Weil) > https://www.mail-archive.com/solr-user@lucene.apache.org/msg151867.html > I think I've come down to the root cause of this mess in our case. > Everything is confirming that the TLOG state is "BUFFERING" rather than > "ACTIVE". > 1/ This can be seen with the metrics API as well where we observe: > "TLOG.replay.remaining.bytes":48997506, > "TLOG.replay.remaining.logs":1, > "TLOG.state":1, > 2/ When a hard commit occurs, we can see it in the logs and as the index > files are updated ; but we can also see that postCommit and preCommit > UpdateLog methods are called but exits immediately which looking at the code > indicates the state is "BUFFERING". > So, why is this TLOG still in "BUFFERING" state? > From the code, the only place where state is set to "BUFFERING" seems to be > UpdateLog.bufferUpdates. > From the logs, in our case it comes from recovery process. We see the message > "Begin buffering updates. core=[col_blue_shard1]". > Just after we can see "Publishing state of core [col_blue_shard1] as > recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am > [http://srv1/solr/col_blue_shard1/]". > Until here, everything is expected I guess but why the TLOG state is not set > to "ACTIVE" a bit later? > Well, the "Begin buffering updates" occured and 500ms later we can see: > - "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we > shut down srv2, this is our main cause of problem) > - "I am going to be the leader srv1" > - "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]" > And 2s later: > - "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] - > recoveringAfterStartup=[true]" > - "Error while trying to recover. > core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get > fingerprint from leader" > - "Finished recovery process, successful=[false]" > At this point, I think the root cause on our side is a rolling update that we > did too quickly: we stopped node2 while node1 while recovering from it. > It's still not clear how everything went back to "active" state after such a > failed recovery and a TLOG still in "BUFFERING". > We shouldn't have been in recovery in the first place and I think we know > why, this is a first thing that we have adressed. > Then we need to add some pauses in our rolling update strategy. > Does it makes sense? Can you think of something else to check/improve? > Best Regards, > Gaël -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org