Hector Geraldino created KAFKA-20628:
----------------------------------------

             Summary: TimeoutException cascades in Connect due to 
PartitionLeaderCache stale entries
                 Key: KAFKA-20628
                 URL: https://issues.apache.org/jira/browse/KAFKA-20628
             Project: Kafka
          Issue Type: Bug
          Components: clients, connect
    Affects Versions: 4.3.0, 4.2.0, 4.1.0, 4.0.0
            Reporter: Hector Geraldino


The {{PartitionLeaderCache}} introduced in 
[KAFKA-17663|https://github.com/apache/kafka/pull/17367] has no expiration 
mechanism. When the admin client is used infrequently (minutes to hours between 
calls), cached partition-leader mappings become stale.

If a cached broker goes down between calls, the next {{listOffsets()}} skips 
the metadata lookup and routes directly to the dead broker, waiting 
{{request.timeout.ms}} (default 30s) before retrying. 

h3. Impact on Kafka Connect

After updating our Kafka Connect fleet from 3.9 to 4.2, we started noticing 
lots of {{TimeoutException}} exceptions being thrown each time our Kafka 
brokers restarted. 

In Connect's distributed mode, the admin client used by {{KafkaBasedLog}} for 
internal topics is called infrequently — mostly for session key rotation every 
hour. When a broker hosting the config topic partition leader is bounced 
between rotations:

  1. {{putSessionKey()}} writes the key (producer refreshes metadata, 
succeeds), then calls {{configLog.readToEnd()}} to confirm.
  2. The background thread's {{admin.endOffsets()}} hits the (now stale) cache, 
sending the request to the dead broker.
  3. The admin's retry cycle ({{default.api.timeout.ms}}, default 60s) exceeds 
{{putSessionKey}}'s hardcoded 30-second budget 
({{READ_WRITE_TOTAL_TIMEOUT_MS}}), causing a {{TimeoutException}}.
  4. The herder enters a {{readConfigToEnd}} retry loop. Each attempt times out 
after {{worker.sync.timeout.ms}} (default 3s), and the worker leaves the group 
— triggering cascading rebalances across the cluster.

h2. Proposed Fix

Add TTL-based expiration to {{PartitionLeaderCache}} using the existing 
{{metadata.max.age.ms}} (default 5 min) as the expiry. As the cached leader 
info is derived from metadata, it makes sense for it not to outlive the 
metadata refresh interval. This preserves the caching optimization for rapid 
successive calls while preventing stale routing for infrequent calls. 

h3. Our workaround

To avoid this problem, we have added the following config knobs to our workers:

  {code}
  request.timeout.ms=10000
  default.api.timeout.ms=20000
  worker.sync.timeout.ms=15000
  {code}

This ensures the admin retry time fits within {{putSessionKey}}'s 30-second 
budget.

  h2. How to Reproduce

  1. Start a distributed Connect cluster (2+ workers).
  2. Stop the broker hosting the {{_connect-configs}} topic partition leader.
  3. Wait for the next session key rotation (or set 
{{inter.worker.key.ttl.ms=60000}}).
  4. Observe {{TimeoutException}}, followed by cascading group leaves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to