[ https://issues.apache.org/jira/browse/KAFKA-14445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haruki Okada updated KAFKA-14445: --------------------------------- Description: Produce requests may fail with timeout by `request.timeout.ms` in below two cases: * Didn't receive produce response within `request.timeout.ms` * Produce response received, but it ended up with `REQUEST_TIMED_OUT` in the broker Former case usually happens when a broker-machine got failed or there's network glitch etc. In this case, the connection will be disconnected and metadata-update will be requested to discover new leader: [https://github.com/apache/kafka/blob/3.3.1/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L556] The problem is in latter case (REQUEST_TIMED_OUT on the broker). In this case, the produce request will be ended up with TimeoutException, which doesn't inherit InvalidMetadataException so it doesn't trigger metadata update. Typical cause of REQUEST_TIMED_OUT is replication delay due to follower-side problem, that metadata-update doesn't make much sense indeed. However, we found that in some cases, stale metadata on REQUEST_TIMED_OUT could cause produce requests to retry unnecessarily , which may end up with batch expiration due to delivery timeout. Below is the scenario we experienced: * Environment: ** Partition tp-0 has 3 replicas, 1, 2, 3. Leader is 1 ** min.insync.replicas=2 ** acks=all * Scenario: ** broker 1 "partially" failed *** It lost ZooKeeper connection and kicked out from the cluster **** There was controller log like: ***** {code:java} [2022-12-04 08:01:04,013] INFO [Controller id=XX] Newly added brokers: , deleted brokers: 1, bounced brokers: {code} * ** *** However, somehow the broker was able continued to receive produce requests **** We're still working on investigating how this is possible though. **** Indeed, broker 1 was somewhat "alive" and keeps working according to server.log *** In other words, broker 1 became "zombie" ** broker 2 was elected as new leader *** broker 3 became follower of broker 2 *** However, since broker 1 was still out of cluster, it didn't receive LeaderAndIsr so 1 kept thinking itself as the leader of tp-0 ** Meanwhile, producer keeps sending produce requests to broker 1 and requests were failed due to REQUEST_TIMED_OUT because no brokers replicates from broker 1. *** REQUEST_TIMED_OUT doesn't trigger metadata update, so produce didn't have a change to update its stale metadata So I suggest to request metadata update even on REQUEST_TIMED_OUT exception, for the case that the old leader became "zombie" was: Produce requests may fail with timeout by `request.timeout.ms` in below two cases: * Didn't receive produce response within `request.timeout.ms` * Produce response received, but it ended up with `REQUEST_TIMEOUT_MS` in the broker Former case usually happens when a broker-machine got failed or there's network glitch etc. In this case, the connection will be disconnected and metadata-update will be requested to discover new leader: [https://github.com/apache/kafka/blob/3.3.1/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L556] The problem is in latter case (REQUEST_TIMED_OUT on the broker). In this case, the produce request will be ended up with TimeoutException, which doesn't inherit InvalidMetadataException so it doesn't trigger metadata update. Typical cause of REQUEST_TIMED_OUT is replication delay due to follower-side problem, that metadata-update doesn't make much sense indeed. However, we found that in some cases, stale metadata on REQUEST_TIMED_OUT could cause produce requests to retry unnecessarily , which may end up with batch expiration due to delivery timeout. Below is the scenario we experienced: * Environment: ** Partition tp-0 has 3 replicas, 1, 2, 3. Leader is 1 ** min.insync.replicas=2 ** acks=all * Scenario: ** broker 1 "partially" failed *** It lost ZooKeeper connection and kicked out from the cluster **** There was controller log like: ***** {code:java} [2022-12-04 08:01:04,013] INFO [Controller id=XX] Newly added brokers: , deleted brokers: 1, bounced brokers: {code} *** However, somehow the broker was able continued to receive produce requests **** We're still working on investigating how this is possible though. **** Indeed, broker 1 was somewhat "alive" and keeps working according to server.log *** In other words, broker 1 became "zombie" ** broker 2 was elected as new leader *** broker 3 became follower of broker 2 *** However, since broker 1 was still out of cluster, it didn't receive LeaderAndIsr so 1 kept thinking itself as the leader of tp-0 ** Meanwhile, producer keeps sending produce requests to broker 1 and requests were failed due to REQUEST_TIMED_OUT because no brokers replicates from broker 1. *** REQUEST_TIMED_OUT doesn't trigger metadata update, so produce didn't have a change to update its stale metadata So I suggest to request metadata update even on REQUEST_TIMED_OUT exception, for the case that the old leader became "zombie" > Producer doesn't request metadata update on REQUEST_TIMED_OUT > ------------------------------------------------------------- > > Key: KAFKA-14445 > URL: https://issues.apache.org/jira/browse/KAFKA-14445 > Project: Kafka > Issue Type: Improvement > Reporter: Haruki Okada > Priority: Major > > Produce requests may fail with timeout by `request.timeout.ms` in below two > cases: > * Didn't receive produce response within `request.timeout.ms` > * Produce response received, but it ended up with `REQUEST_TIMED_OUT` in the > broker > Former case usually happens when a broker-machine got failed or there's > network glitch etc. > In this case, the connection will be disconnected and metadata-update will be > requested to discover new leader: > [https://github.com/apache/kafka/blob/3.3.1/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L556] > > The problem is in latter case (REQUEST_TIMED_OUT on the broker). > In this case, the produce request will be ended up with TimeoutException, > which doesn't inherit InvalidMetadataException so it doesn't trigger metadata > update. > > Typical cause of REQUEST_TIMED_OUT is replication delay due to follower-side > problem, that metadata-update doesn't make much sense indeed. > > However, we found that in some cases, stale metadata on REQUEST_TIMED_OUT > could cause produce requests to retry unnecessarily , which may end up with > batch expiration due to delivery timeout. > Below is the scenario we experienced: > * Environment: > ** Partition tp-0 has 3 replicas, 1, 2, 3. Leader is 1 > ** min.insync.replicas=2 > ** acks=all > * Scenario: > ** broker 1 "partially" failed > *** It lost ZooKeeper connection and kicked out from the cluster > **** There was controller log like: > ***** > {code:java} > [2022-12-04 08:01:04,013] INFO [Controller id=XX] Newly added brokers: , > deleted brokers: 1, bounced brokers: {code} > * > ** > *** However, somehow the broker was able continued to receive produce > requests > **** We're still working on investigating how this is possible though. > **** Indeed, broker 1 was somewhat "alive" and keeps working according to > server.log > *** In other words, broker 1 became "zombie" > ** broker 2 was elected as new leader > *** broker 3 became follower of broker 2 > *** However, since broker 1 was still out of cluster, it didn't receive > LeaderAndIsr so 1 kept thinking itself as the leader of tp-0 > ** Meanwhile, producer keeps sending produce requests to broker 1 and > requests were failed due to REQUEST_TIMED_OUT because no brokers replicates > from broker 1. > *** REQUEST_TIMED_OUT doesn't trigger metadata update, so produce didn't > have a change to update its stale metadata > > So I suggest to request metadata update even on REQUEST_TIMED_OUT exception, > for the case that the old leader became "zombie" -- This message was sent by Atlassian Jira (v8.20.10#820010)