[ https://issues.apache.org/jira/browse/KAFKA-16890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Duggana resolved KAFKA-16890. ------------------------------------ Resolution: Fixed > Failing to build aux state on broker failover > --------------------------------------------- > > Key: KAFKA-16890 > URL: https://issues.apache.org/jira/browse/KAFKA-16890 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage > Affects Versions: 3.7.0, 3.7.1 > Reporter: Francois Visconte > Assignee: Kamal Chandraprakash > Priority: Major > Fix For: 3.8.0 > > > We have clusters where we replace machines often falling into a state where > we keep having "Error building remote log auxiliary state for > loadtest_topic-22" and the partition being under-replicated until the leader > is manually restarted. > Looking into a specific case, here is what we observed in > __remote_log_metadata topic: > {code:java} > > partition: 29, offset: 183593, value: > RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=GZeRTXLMSNe2BQjRXkg6hQ}, startOffset=10823, endOffset=11536, > brokerId=10013, maxTimestampMs=1715774588597, eventTimestampMs=1715781657604, > segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, > 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535}, > segmentSizeInBytes=704895, customMetadata=Optional.empty, > state=COPY_SEGMENT_STARTED} > partition: 29, offset: 183594, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, > state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715781658183, brokerId=10013} > partition: 29, offset: 183669, value: > RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=L1TYzx0lQkagRIF86Kp0QQ}, startOffset=10823, endOffset=11544, > brokerId=10008, maxTimestampMs=1715781445270, eventTimestampMs=1715782717593, > segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, > 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535, 140=11537, > 142=11543}, segmentSizeInBytes=713088, customMetadata=Optional.empty, > state=COPY_SEGMENT_STARTED} > partition: 29, offset: 183670, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, > state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715782718370, brokerId=10008} > partition: 29, offset: 186215, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, > state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874617, brokerId=10008} > partition: 29, offset: 186216, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, > state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874725, brokerId=10008} > partition: 29, offset: 186217, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, > state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874729, brokerId=10008} > partition: 29, offset: 186218, value: > RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, > id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, > state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874817, brokerId=10008} > {code} > > It seems that at the time the leader is restarted (10013), a second copy of > the same segment is tiered by the new leader (10008). Interestingly the > segment doesn't have the same end offset, which is concerning. > Then the follower sees the following error repeatedly until the leader is > restarted: > > {code:java} > [2024-05-17 20:46:42,133] DEBUG [ReplicaFetcher replicaId=10013, > leaderId=10008, fetcherId=0] Handling errors in processFetchRequest for > partitions HashSet(loadtest_topic-22) (kafka.server.ReplicaFetcherThread) > [2024-05-17 20:46:43,174] DEBUG [ReplicaFetcher replicaId=10013, > leaderId=10008, fetcherId=0] Received error OFFSET_MOVED_TO_TIERED_STORAGE, > at fetch offset: 11537, topic-partition: loadtest_topic-22 > (kafka.server.ReplicaFetcherThread) > [2024-05-17 20:46:43,175] ERROR [ReplicaFetcher replicaId=10013, > leaderId=10008, fetcherId=0] Error building remote log auxiliary state for > loadtest_topic-22 (kafka.server.ReplicaFetcherThread) > org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't > build the state from remote store for partition: loadtest_topic-22, > currentLeaderEpoch: 153, leaderLocalLogStartOffset: 11545, > leaderLogStartOffset: 11537, epoch: 142as the previous remote log segment > metadata was not found > {code} > The follower is trying to fetch from 11537 and gets > OFFSET_MOVED_TO_TIERED_STORAGE . Then when the follower retries, it still > thinks it needs to fetch from 11537 . There is no data in S3, so the correct > leaderLogStartOffset should be 11545 . I'm not sure yet if its intentional > that there can be two copies of the same segment that are different uploaded > to S3 or if the segments were just deleted in the wrong order, but that is > what ultimately caused the leaderLogStartOffset to be set incorrectly. -- This message was sent by Atlassian Jira (v8.20.10#820010)