Francois Visconte created KAFKA-16890: -----------------------------------------
Summary: Failing to build aux state on broker failover Key: KAFKA-16890 URL: https://issues.apache.org/jira/browse/KAFKA-16890 Project: Kafka Issue Type: Bug Components: Tiered-Storage Affects Versions: 3.7.0, 3.7.1 Reporter: Francois Visconte We have clusters where we replace machines often falling into a state where we keep having "Error building remote log auxiliary state for loadtest_topic-22" and the partition being under-replicated until the leader is manually restarted. Looking into a specific case, here is what we observed in __remote_log_metadata topic: {code:java} partition: 29, offset: 183593, value: RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=GZeRTXLMSNe2BQjRXkg6hQ}, startOffset=10823, endOffset=11536, brokerId=10013, maxTimestampMs=1715774588597, eventTimestampMs=1715781657604, segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535}, segmentSizeInBytes=704895, customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED} partition: 29, offset: 183594, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715781658183, brokerId=10013} partition: 29, offset: 183669, value: RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=L1TYzx0lQkagRIF86Kp0QQ}, startOffset=10823, endOffset=11544, brokerId=10008, maxTimestampMs=1715781445270, eventTimestampMs=1715782717593, segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324, 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535, 140=11537, 142=11543}, segmentSizeInBytes=713088, customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED} partition: 29, offset: 183670, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715782718370, brokerId=10008} partition: 29, offset: 186215, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874617, brokerId=10008} partition: 29, offset: 186216, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty, state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874725, brokerId=10008} partition: 29, offset: 186217, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874729, brokerId=10008} partition: 29, offset: 186218, value: RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22, id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty, state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874817, brokerId=10008} {code} It seems that at the time the leader is restarted (10013), a second copy of the same segment is tiered by the new leader (10008). Interestingly the segment doesn't have the same end offset, which is concerning. Then the follower sees the following error repeatedly until the leader is restarted: {code:java} [2024-05-17 20:46:42,133] DEBUG [ReplicaFetcher replicaId=10013, leaderId=10008, fetcherId=0] Handling errors in processFetchRequest for partitions HashSet(loadtest_topic-22) (kafka.server.ReplicaFetcherThread) [2024-05-17 20:46:43,174] DEBUG [ReplicaFetcher replicaId=10013, leaderId=10008, fetcherId=0] Received error OFFSET_MOVED_TO_TIERED_STORAGE, at fetch offset: 11537, topic-partition: loadtest_topic-22 (kafka.server.ReplicaFetcherThread) [2024-05-17 20:46:43,175] ERROR [ReplicaFetcher replicaId=10013, leaderId=10008, fetcherId=0] Error building remote log auxiliary state for loadtest_topic-22 (kafka.server.ReplicaFetcherThread) org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't build the state from remote store for partition: loadtest_topic-22, currentLeaderEpoch: 153, leaderLocalLogStartOffset: 11545, leaderLogStartOffset: 11537, epoch: 142as the previous remote log segment metadata was not found {code} The follower is trying to fetch from 11537 and gets OFFSET_MOVED_TO_TIERED_STORAGE . Then when the follower retries, it still thinks it needs to fetch from 11537 . There is no data in S3, so the correct leaderLogStartOffset should be 11545 . I'm not sure yet if its intentional that there can be two copies of the same segment that are different uploaded to S3 or if the segments were just deleted in the wrong order, but that is what ultimately caused the leaderLogStartOffset to be set incorrectly. -- This message was sent by Atlassian Jira (v8.20.10#820010)