[jira] [Commented] (KAFKA-12241) Partition offline when ISR shrinks to leader and LogDir goes offline

Tobias Gustafsson (Jira) Fri, 21 Oct 2022 00:41:09 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621891#comment-17621891
 ]


Tobias Gustafsson commented on KAFKA-12241:
-------------------------------------------

We've also been hit by this issue in Kafka 2.8.1. The trigger is different, in 
our case it is chaotic network conditions, but the end result is the exact 
same. A fix for it would be much appreciated!

We use acks=all, min-isr >=2 so the solution proposed in this ticket (and also 
the one in KAFKA-3861) seem perfectly viable to get the partitions online again 
without risking data loss.


> Partition offline when ISR shrinks to leader and LogDir goes offline
> --------------------------------------------------------------------
>
>                 Key: KAFKA-12241
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12241
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.2
>            Reporter: Noa Resare
>            Priority: Major
>
> This is a long standing issue that we haven't previously tracked in a JIRA. 
> We experience this maybe once per month on average and we see the following 
> sequence of events:
>  # A broker shrinks ISR to just itself for a partition. However, the 
> followers are at highWatermark:{{ [Partition PARTITION broker=601] Shrinking 
> ISR from 1501,601,1201,1801 to 601. Leader: (highWatermark: 432385279, 
> endOffset: 432385280). Out of sync replicas: (brokerId: 1501, endOffset: 
> 432385279) (brokerId: 1201, endOffset: 432385279) (brokerId: 1801, endOffset: 
> 432385279).}}
>  # Around this time (in the case I have in front of me, 20ms earlier 
> according to the logging subsystem) LogDirFailureChannel captures an Error 
> while appending records to PARTITION due to a readonly filesystem.
>  # ~20 ms after the ISR shrink, LogDirFailureHandler offlines the partition: 
> Logs for partitions LIST_OF_PARTITIONS are offline and logs for future 
> partitions are offline due to failure on log directory /kafka/d6/data 
>  # ~50ms later the controller marks the replicas as offline from 601: 
> message: [Controller id=901] Mark replicas LIST_OF_PARTITIONS on broker 601 
> as offline 
>  # ~2ms later the controller offlines the partition: [Controller id=901 
> epoch=4] Changed partition PARTITION state from OnlinePartition to 
> OfflinePartition 
> To resolve this someone needs to manually enable unclean leader election, 
> which is obviously not ideal. Since the leader knows that all the followers 
> that are removed from ISR is at highWatermark, maybe it could convey that to 
> the controller in the LeaderAndIsr response so that the controller could pick 
> a new leader without having to resort to unclean leader election.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-12241) Partition offline when ISR shrinks to leader and LogDir goes offline

Reply via email to