[jira] [Created] (KAFKA-15688) Partition leader election not running when disk IO hangs

Peter Sinoros-Szabo (Jira) Thu, 26 Oct 2023 05:45:06 -0700

Peter Sinoros-Szabo created KAFKA-15688:
-------------------------------------------


             Summary: Partition leader election not running when disk IO hangs
                 Key: KAFKA-15688
                 URL: https://issues.apache.org/jira/browse/KAFKA-15688
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 3.3.2
            Reporter: Peter Sinoros-Szabo


We run our Kafka brokers on AWS EC2 nodes using AWS EBS as disk to store the 
messages.

Recently we had an issue when the EBS disk IO just stalled so Kafka was not 
able to write or read anything from the disk, well except the data that was 
still in page cache or that still fitted into the page cache before it is 
synced to EBS.

We experienced this issue in a few cases: sometimes partition leaders were 
moved away to other brokers automatically, in other cases that didn't happen 
and caused the Producers to fail producing messages to that broker.

My expectation from Kafka in such a case would be that it notices it and moves 
the leaders to other brokers where the partition has in sync replicas, but as I 
mentioned this didn't happen always.

I know Kafka will shut itself down in case it can't write to its disk, that 
might be a good solution in this case as well as it would trigger the leader 
election automatically.

Is it possible to add such a feature to Kafka so that it shuts down in this 
case as well?

I guess similar issue might happen with other disk subsystems too or even with 
a broken and slow disk.

This scenario can be easily reproduced using AWS FIS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-15688) Partition leader election not running when disk IO hangs

Reply via email to