Kay Hartmann created KAFKA-17831:
------------------------------------
Summary: Transaction coordinators returning
COORDINATOR_LOAD_IN_PROGRESS until leader changes or brokers are restarted
after network instability
Key: KAFKA-17831
URL: https://issues.apache.org/jira/browse/KAFKA-17831
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.7.1, 3.6.1
Reporter: Kay Hartmann
After experiencing a (heavy) network outage/instability, our brokers arrived in
a state where some producers were not able to perform transactions, but the
brokers continued to respond to those producers with
`COORDINATOR_LOAD_IN_PROGRESS`. We were able to see corresponding DEBUG logs in
the brokers:
{code:java}
2024-08-06 15:22:01,178 DEBUG [TransactionCoordinator id=11] Returning
COORDINATOR_LOAD_IN_PROGRESS error code to client for my-client's AddPartitions
request (kafka.coordinator.transaction.TransactionCoordinator)
[data-plane-kafka-request-handler-5] {code}
This did not occur for all transactions, but for a subset of transactional ids
with the same hash that would go through the same transaction
coordinator/partition leader for the corresponding `__transaction_state`
partition. We were able to resolve this the first time by shifting partition
leaders for the transaction topic around and the second time by simply
restarting brokers.
This lead us to believe that it has to be some kind of dirty in-memory state
transaction coordinators have for a `__transaction_state` partition. We found
two cases
([#1|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L319],
[#2|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L376])
in which the TransactionStateManager returns `COORDINATOR_LOAD_IN_PROGRESS`.
In both cases `loadingPartitions` has some state that signals that the
TransactionStateManager is still occupied with initializing transactional data
for that `__transaction_state` partition.
We believe that the network outage caused partition leaders to be shifted
around continuously between their replicas and somehow this lead to outdated
data in `loadingPartitions` that wasn't cleaned up. I had a look at the
[method|https://github.com/apache/kafka/blob/3.6.1/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L518]
where it is updated and cleaned, but wasn't able to identify a case in which
there could be a failure to clean.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)