[jira] [Comment Edited] (KAFKA-9837) New RPC for notifying controller of failed replica

Igor Soarez (Jira) Sat, 07 May 2022 07:39:41 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533287#comment-17533287
 ]


Igor Soarez edited comment on KAFKA-9837 at 5/7/22 2:38 PM:
------------------------------------------------------------

I'm trying to figure out what's the best way forward here.

It is nice that the controller does not decide which log directory gets to host 
each replica. This can allow each broker to make better allocation decisions, 
e.g. based on disk usage information. This benefit was also discussed as part 
of KIP-589 so I think we should try to keep that property if possible.

It would be useful to clarify our concerns here a bit further.

In ZooKeeper mode the broker notifies the controller via a z-node update, which 
then triggers a full LeaderAndIsr request back to the broker, and relies on per 
partition errors to identify the failed replicas. The approach in [~dengziming] 
's PR improves this with a single request that includes all the failed 
replicas. So this is already an improvement, as we go from:

{{Broker updates log dir failure z-node without specific replica information -> 
ZK updates the z-node -> z-node watch triggered in Controller -> Controller 
fetch notification -> ZK replies with notifications -> Controller sends 
LeaderAndIsr -> Broker answers -> Controller identifies failed replicas}}

To:

{{Broker sends full list of failed replicas -> Controller identifies failed 
replicas}}

However, if there are a lot of replicas in the same log directory that can mean 
both that:
 # The RPC from the broker to the controller indicating failed replicas in the 
log directory can be a very large request including O ( n ) replicas.
 # The metadata records generated in the KRaft controller are O ( n ).

Which of these are we aiming to tackle [~cmccabe] ? I may be missing some other 
concern here, please let me know.

I also have a couple of questions regarding each of these concerns:

Regarding 1., if we use Topic IDs instead of topic names the request it is 
likely to be smaller and have a more predictable size. e.g. offlining one 
million partitions should be about a 20 MB request (1M * (128 bits topic ID + 
32 bits partition ID)), is this unreasonable?

Regarding 2., currently, when a broker is fenced, metadata records are 
generated for every changed ISR set. That'll include all log directories, so 
this same concern should apply also for fenced brokers. Do we want to update 
this mechanism as well to have a single record identify all failed partitions?

One way to prevent O ( n ) requests and updates is to replace the 
ALTER_REPLICA_STATE RPC proposed in KIP-589 with a different set of RPCs:
 * ASSIGN_REPLICAS_TO_FAIL_GROUP - Each broker sends this RPC to the controller 
after assigning one or more
  replicas to any log directory, before activating the replica. It associates 
each replica to its assigned log directory. It introduces some extra delay in 
assigning new replicas to brokers but to the extra roundtrip and batching but 
that might be okay.
 * FAIL_GROUP - When a log directory fails, a single small request with the
  broker ID, and failure group indicates that all previously associated 
replicas are now offline.

The metadata log would include equivalent records to these RPCs. A metadata 
delta would be able to identify all the fail replicas from the group.

We can also use the same concept - the conceptual grouping replicas into 
"failure groups/domains" to minimize the metadata updates when a broker is 
fenced.

WDYT? If this sounds like it may be the general right direction I can work on a 
KIP.

 

 


was (Author: soarez):
I'm trying to figure out what's the best way forward here.

It is nice that the controller does not decide which log directory gets to host 
each replica. This can allow each broker to make better allocation decisions, 
e.g. based on disk usage information. This benefit was also discussed as part 
of KIP-589 so I think we should try to keep that property if possible.

It would be useful to clarify our concerns here a bit further.

In ZooKeeper mode the broker notifies the controller via a z-node update, which 
then triggers a full LeaderAndIsr request back to the broker, and relies on per 
partition errors to identify the failed replicas. The approach in [~dengziming] 
's PR improves this with a single request that includes all the failed 
replicas. So this is already an improvement, as we go from:

{{Broker updates log dir failure z-node without specific replica information -> 
ZK updates the z-node -> z-node watch triggered in Controller -> Controller 
fetch notification -> ZK replies with notifications -> Controller sends 
LeaderAndIsr -> Broker answers -> Controller identifies failed replicas}}

To:

{{Broker sends full list of failed replicas -> Controller identifies failed 
replicas}}

However, if there are a lot of replicas in the same log directory that can mean 
both that:
 # The RPC from the broker to the controller indicating failed replicas in the 
log directory can be a very large request including O ( n ) replicas.
 # The metadata records generated in the KRaft controller are O ( n ).

Which of these are we aiming to tackle [~cmccabe] ? I may be missing some other 
concern here, please let me know.

I also have a couple of questions regarding each of these concerns:

Regarding 1., if we use Topic IDs instead of topic names the request it is 
likely to be smaller and have a more predictable size. e.g. offlining one 
million partitions should be about a 20 MB request (1M * (128 bits topic ID + 
32 bits partition ID)), is this unreasonable?

Regarding 2., currently, when a broker is fenced, metadata records are 
generated for every changed ISR set. That'll include all log directories, so 
this same concern should apply also for fenced brokers. Do we want to update 
this mechanism as well to have a single record identify all failed partitions?

One way to prevent O(n) requests and updates is to replace the 
ALTER_REPLICA_STATE RPC proposed in KIP-589 with a different set of RPCs:
 * ASSIGN_REPLICAS_TO_FAIL_GROUP - Each broker sends this RPC to the controller 
after assigning one or more
  replicas to any log directory, before activating the replica. It associates 
each replica to its assigned log directory. It introduces some extra delay in 
assigning new replicas to brokers but to the extra roundtrip and batching but 
that might be okay.
 * FAIL_GROUP - When a log directory fails, a single small request with the
  broker ID, and failure group indicates that all previously associated 
replicas are now offline.

The metadata log would include equivalent records to these RPCs. A metadata 
delta would be able to identify all the fail replicas from the group.

We can also use the same concept - the conceptual grouping replicas into 
"failure groups/domains" to minimize the metadata updates when a broker is 
fenced.

WDYT? If this sounds like it may be the general right direction I can work on a 
KIP.

 

 

> New RPC for notifying controller of failed replica
> --------------------------------------------------
>
>                 Key: KAFKA-9837
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9837
>             Project: Kafka
>          Issue Type: New Feature
>          Components: controller, core
>            Reporter: David Arthur
>            Assignee: dengziming
>            Priority: Major
>              Labels: kip-500
>
> This is the tracking ticket for 
> [KIP-589|https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller].
>  For the bridge release, brokers should no longer use ZooKeeper to notify the 
> controller that a log dir has failed. It should instead use an RPC mechanism.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (KAFKA-9837) New RPC for notifying controller of failed replica

Reply via email to