Francois Visconte created KAFKA-16890:
-----------------------------------------

             Summary: Failing to build aux state on broker failover
                 Key: KAFKA-16890
                 URL: https://issues.apache.org/jira/browse/KAFKA-16890
             Project: Kafka
          Issue Type: Bug
          Components: Tiered-Storage
    Affects Versions: 3.7.0, 3.7.1
            Reporter: Francois Visconte


We have clusters where we replace machines often falling into a state where we 
keep having "Error building remote log auxiliary state for loadtest_topic-22" 
and the partition being under-replicated until the leader is manually 
restarted. 

Looking into a specific case, here is what we observed in __remote_log_metadata 
topic:


{code:java}
 
partition: 29, offset: 183593, value: 
RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=GZeRTXLMSNe2BQjRXkg6hQ}, startOffset=10823, endOffset=11536, 
brokerId=10013, maxTimestampMs=1715774588597, eventTimestampMs=1715781657604, 
segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, 
133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535}, 
segmentSizeInBytes=704895, customMetadata=Optional.empty, 
state=COPY_SEGMENT_STARTED}
partition: 29, offset: 183594, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, 
state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715781658183, brokerId=10013}
partition: 29, offset: 183669, value: 
RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=L1TYzx0lQkagRIF86Kp0QQ}, startOffset=10823, endOffset=11544, 
brokerId=10008, maxTimestampMs=1715781445270, eventTimestampMs=1715782717593, 
segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, 
133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535, 140=11537, 
142=11543}, segmentSizeInBytes=713088, customMetadata=Optional.empty, 
state=COPY_SEGMENT_STARTED}
partition: 29, offset: 183670, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, 
state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715782718370, brokerId=10008}
partition: 29, offset: 186215, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, 
state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874617, brokerId=10008}
partition: 29, offset: 186216, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, 
state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874725, brokerId=10008}
partition: 29, offset: 186217, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, 
state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874729, brokerId=10008}
partition: 29, offset: 186218, value: 
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
 id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, 
state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874817, brokerId=10008}
{code}
 

It seems that at the time the leader is restarted (10013), a second copy of the 
same segment is tiered by the new leader (10008). Interestingly the segment 
doesn't have the same end offset, which is concerning. 

Then the follower sees the following error repeatedly until the leader is 
restarted: 



 
{code:java}
[2024-05-17 20:46:42,133] DEBUG [ReplicaFetcher replicaId=10013, 
leaderId=10008, fetcherId=0] Handling errors in processFetchRequest for 
partitions HashSet(loadtest_topic-22) (kafka.server.ReplicaFetcherThread)
[2024-05-17 20:46:43,174] DEBUG [ReplicaFetcher replicaId=10013, 
leaderId=10008, fetcherId=0] Received error OFFSET_MOVED_TO_TIERED_STORAGE, at 
fetch offset: 11537, topic-partition: loadtest_topic-22 
(kafka.server.ReplicaFetcherThread)
[2024-05-17 20:46:43,175] ERROR [ReplicaFetcher replicaId=10013, 
leaderId=10008, fetcherId=0] Error building remote log auxiliary state for 
loadtest_topic-22 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't 
build the state from remote store for partition: loadtest_topic-22, 
currentLeaderEpoch: 153, leaderLocalLogStartOffset: 11545, 
leaderLogStartOffset: 11537, epoch: 142as the previous remote log segment 
metadata was not found
{code}
The follower is trying to fetch from 11537 and gets 
OFFSET_MOVED_TO_TIERED_STORAGE . Then when the follower retries, it still 
thinks it needs to fetch from 11537 . There is no data in S3, so the correct 
leaderLogStartOffset should be 11545 .  I'm not sure yet if its intentional 
that there can be two copies of the same segment that are different uploaded to 
S3 or if the segments were just deleted in the wrong order, but that is what 
ultimately caused the leaderLogStartOffset to be set incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to