We monitor the log flush latency p95 on all our Kafka nodes and occasionally we see it creep up from the regular figure of under 15 ms to above 150 ms.
Restarting the node usually doesn't help. It seems to fix itself over time but we are not quite sure about the underlying reason. It's bytes-in/second and messages-in/second are in line with the other brokers in the cluster. When one of these incidents happen it usually lasts for hours. Thanks, Rajiv