[
https://issues.apache.org/jira/browse/SOLR-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18031964#comment-18031964
]
Gaël Jourdan commented on SOLR-14679:
-------------------------------------
Hello folks,
We didn't had the issue again until recently but it seems to be coming back
from time to time.
Solr version: 9.9.0 386ed096a1946c488cfe576a19a147bdb1153508
Here's what I observe:
{code:java}
ll -h .../collection_shard1_replica_n4/data/tlog/
total 11G
-rw-r--r-- 1 xxx xxx 85M Oct 16 04:05 tlog.0000000000000000119
-rw-r--r-- 1 xxx xxx 11G Oct 22 04:05 tlog.0000000000000000120 {code}
That is a gigantic TLOG file (11G).
And If I look at the Admin Metrics, I can, like in the original issue, that the
TLOG is still in BUFFERING state (state=1):
{code:java}
"TLOG.applyingBuffered.ops":{
"count":0,
"meanRate":0.0,
"1minRate":0.0,
"5minRate":0.0,
"15minRate":0.0
},
"TLOG.buffered.ops":0,
"TLOG.copyOverOldUpdates.ops":{
"count":0,
"meanRate":0.0,
"1minRate":0.0,
"5minRate":0.0,
"15minRate":0.0
},
"TLOG.replay.ops":{
"count":0,
"meanRate":0.0,
"1minRate":0.0,
"5minRate":0.0,
"15minRate":0.0
},
"TLOG.replay.remaining.bytes":88853809,
"TLOG.replay.remaining.logs":1,
"TLOG.state":1, {code}
Is there a chance that the fix got reverted or maybe there's another situation
when this can happen.
> TLOGs grow forever, never get out of BUFFERING state
> ----------------------------------------------------
>
> Key: SOLR-14679
> URL: https://issues.apache.org/jira/browse/SOLR-14679
> Project: Solr
> Issue Type: Bug
> Reporter: Erick Erickson
> Assignee: Houston Putman
> Priority: Critical
> Fix For: 9.1, 8.11.3, 10.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> From the user's list,
> (Gael Jourdan-Weil)
> https://www.mail-archive.com/[email protected]/msg151867.html
> I think I've come down to the root cause of this mess in our case.
> Everything is confirming that the TLOG state is "BUFFERING" rather than
> "ACTIVE".
> 1/ This can be seen with the metrics API as well where we observe:
> "TLOG.replay.remaining.bytes":48997506,
> "TLOG.replay.remaining.logs":1,
> "TLOG.state":1,
> 2/ When a hard commit occurs, we can see it in the logs and as the index
> files are updated ; but we can also see that postCommit and preCommit
> UpdateLog methods are called but exits immediately which looking at the code
> indicates the state is "BUFFERING".
> So, why is this TLOG still in "BUFFERING" state?
> From the code, the only place where state is set to "BUFFERING" seems to be
> UpdateLog.bufferUpdates.
> From the logs, in our case it comes from recovery process. We see the message
> "Begin buffering updates. core=[col_blue_shard1]".
> Just after we can see "Publishing state of core [col_blue_shard1] as
> recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am
> [http://srv1/solr/col_blue_shard1/]".
> Until here, everything is expected I guess but why the TLOG state is not set
> to "ACTIVE" a bit later?
> Well, the "Begin buffering updates" occured and 500ms later we can see:
> - "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we
> shut down srv2, this is our main cause of problem)
> - "I am going to be the leader srv1"
> - "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
> And 2s later:
> - "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] -
> recoveringAfterStartup=[true]"
> - "Error while trying to recover.
> core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get
> fingerprint from leader"
> - "Finished recovery process, successful=[false]"
> At this point, I think the root cause on our side is a rolling update that we
> did too quickly: we stopped node2 while node1 while recovering from it.
> It's still not clear how everything went back to "active" state after such a
> failed recovery and a TLOG still in "BUFFERING".
> We shouldn't have been in recovery in the first place and I think we know
> why, this is a first thing that we have adressed.
> Then we need to add some pauses in our rolling update strategy.
> Does it makes sense? Can you think of something else to check/improve?
> Best Regards,
> Gaël
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]