Peter Sinoros-Szabo created KAFKA-15688: -------------------------------------------
Summary: Partition leader election not running when disk IO hangs Key: KAFKA-15688 URL: https://issues.apache.org/jira/browse/KAFKA-15688 Project: Kafka Issue Type: Bug Components: core Affects Versions: 3.3.2 Reporter: Peter Sinoros-Szabo We run our Kafka brokers on AWS EC2 nodes using AWS EBS as disk to store the messages. Recently we had an issue when the EBS disk IO just stalled so Kafka was not able to write or read anything from the disk, well except the data that was still in page cache or that still fitted into the page cache before it is synced to EBS. We experienced this issue in a few cases: sometimes partition leaders were moved away to other brokers automatically, in other cases that didn't happen and caused the Producers to fail producing messages to that broker. My expectation from Kafka in such a case would be that it notices it and moves the leaders to other brokers where the partition has in sync replicas, but as I mentioned this didn't happen always. I know Kafka will shut itself down in case it can't write to its disk, that might be a good solution in this case as well as it would trigger the leader election automatically. Is it possible to add such a feature to Kafka so that it shuts down in this case as well? I guess similar issue might happen with other disk subsystems too or even with a broken and slow disk. This scenario can be easily reproduced using AWS FIS. -- This message was sent by Atlassian Jira (v8.20.10#820010)