[jira] [Created] (KUDU-3471) Enforce flushing of tablet-meta

Abhishek Chennaka (Jira) Fri, 14 Apr 2023 15:56:10 -0700

Abhishek Chennaka created KUDU-3471:
---------------------------------------


             Summary: Enforce flushing of tablet-meta
                 Key: KUDU-3471
                 URL: https://issues.apache.org/jira/browse/KUDU-3471
             Project: Kudu
          Issue Type: Bug
            Reporter: Abhishek Chennaka


We suspect tablet-meta was not updated which lead to tablet not being able to 
startup. Below is the log analysis:

1. There was a restart of the cluster which was done on Dec 2 and the tablet 
5d3f10a0427745c7abdc889dae6f62b0 bootstrapped successfully. The last known 
committed index was logged as 1455:
{code:java}
Last known committed idx: 1455
{code}
2. The WAL segment containing the ops 1411-1455 was GC'd which indicates this 
was persisted in the data disk of the server.
{code:java}
I1202 12:15:03.523751  9026 log.cc:1068] T 5d3f10a0427745c7abdc889dae6f62b0 P 
b395541607c54801955a6b5ed310e67c: Deleting log segment in path: 
/data/kudu/0/wals/5d3f10a0427745c7abdc889dae6f62b0/wal-000000001 (ops 
1411-1455).
{code}
There were Flushes which happened between Dec 02 and Jan 27:
{code:java}
I0116 11:40:33.197038  9463 maintenance_manager.cc:382] P 
b395541607c54801955a6b5ed310e67c: Scheduling 
FlushMRSOp(5d3f10a0427745c7abdc889dae6f62b0): perf score=1.000000
I0116 11:42:33.589488  9463 maintenance_manager.cc:382] P 
b395541607c54801955a6b5ed310e67c: Scheduling 
FlushMRSOp(5d3f10a0427745c7abdc889dae6f62b0): perf score=0.033403
I0125 15:26:57.443202  9463 maintenance_manager.cc:382] P 
b395541607c54801955a6b5ed310e67c: Scheduling 
FlushMRSOp(5d3f10a0427745c7abdc889dae6f62b0): perf score=1.000000
I0125 15:28:57.865049  9463 maintenance_manager.cc:382] P 
b395541607c54801955a6b5ed310e67c: Scheduling 
FlushMRSOp(5d3f10a0427745c7abdc889dae6f62b0): perf score=0.033412
{code}
3. As a part of Tablet::DoMergeCompactionOrFlush() we update the TabletMetadata 
during every flush.
[https://github.com/apache/kudu/blob/master/src/kudu/tablet/tablet.cc#L2205]
All of this is to say there were multiple attempts to update the Tablet 
Metadata after the WAL segment was GC'd on Dec 2.

4. When the tablet server was restarted on Jan 27, as a part of the tablet 
bootstrap, we refer to Tablet Metadata to fetch the last flushed rowset id 
(last_durable_mrs_id) to the data disk when replaying the WAL segments. This 
seems to be referring to mrs id with index less than 1455 which should have 
been flushed and don't need to be replayed. Since the WAL segment was GC'd we 
ended up in the tablet stopped state.
{code:java}
CommitMsg was orphaned but it referred to stores which need replay. Commit: 
op_type: WRITE_OP commited_op_id { term: 17 index: 1455 }
{code}
[https://github.com/apache/kudu/blob/master/src/kudu/tablet/tablet_bootstrap.cc#L1072]

The tablet-meta of the affected tablets could not be collected unfortunately 
but the only possible explanation of the above is if the metadata of the tablet 
is not updated.

Having some sort of force fsycing on tablet-meta files similar to cmeta should 
help prevent such scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KUDU-3471) Enforce flushing of tablet-meta

Reply via email to