José Armando García Sancio created KAFKA-14703:
--------------------------------------------------

             Summary: Don't resign when failing to replay uncommitted records
                 Key: KAFKA-14703
                 URL: https://issues.apache.org/jira/browse/KAFKA-14703
             Project: Kafka
          Issue Type: Improvement
          Components: controller
            Reporter: José Armando García Sancio


h1. Problem

The KRaft controller is replays both committed and uncommitted records. 
Committed records are replayed by the inactive controller. Uncommitted records 
are replayed by the active controller.

When handling an RPC the active controller generates a response and a list of 
uncommitted records. The active controller replays the uncommitted records 
before sending them to the KRaft layer for durability and replication. If the 
active controller encounters an error when replaying the uncommitted records, 
it calls the process exit fault handler.

Indirectly, the process exit fault handler resigns its KRaft leadership and 
closes all of the client connections.

Most clients to retry the RPC when they disconnect from the remote endpoint. If 
the RPC's replay error is deterministic then it is possible for the failure to 
propagate to all of the controllers as they become leaders. This handling may 
cause the controllers to become unavailable.
h1. Solution

We can avoid this failure from propagating to all of the controllers by 
changing how we handle errors when replaying uncommitted records. The active 
controller doesn't need to fatally exit, if it failed to replay an uncommitted 
record. The active controller should instead failed the RPC with an 
UNKNOWN_ERROR and revert the in-memory state to the in-memory snapshot before 
the RPC was handled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to