[
https://issues.apache.org/jira/browse/KAFKA-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533287#comment-17533287
]
Igor Soarez edited comment on KAFKA-9837 at 5/7/22 2:38 PM:
------------------------------------------------------------
I'm trying to figure out what's the best way forward here.
It is nice that the controller does not decide which log directory gets to host
each replica. This can allow each broker to make better allocation decisions,
e.g. based on disk usage information. This benefit was also discussed as part
of KIP-589 so I think we should try to keep that property if possible.
It would be useful to clarify our concerns here a bit further.
In ZooKeeper mode the broker notifies the controller via a z-node update, which
then triggers a full LeaderAndIsr request back to the broker, and relies on per
partition errors to identify the failed replicas. The approach in [~dengziming]
's PR improves this with a single request that includes all the failed
replicas. So this is already an improvement, as we go from:
{{Broker updates log dir failure z-node without specific replica information ->
ZK updates the z-node -> z-node watch triggered in Controller -> Controller
fetch notification -> ZK replies with notifications -> Controller sends
LeaderAndIsr -> Broker answers -> Controller identifies failed replicas}}
To:
{{Broker sends full list of failed replicas -> Controller identifies failed
replicas}}
However, if there are a lot of replicas in the same log directory that can mean
both that:
# The RPC from the broker to the controller indicating failed replicas in the
log directory can be a very large request including O ( n ) replicas.
# The metadata records generated in the KRaft controller are O ( n ).
Which of these are we aiming to tackle [~cmccabe] ? I may be missing some other
concern here, please let me know.
I also have a couple of questions regarding each of these concerns:
Regarding 1., if we use Topic IDs instead of topic names the request it is
likely to be smaller and have a more predictable size. e.g. offlining one
million partitions should be about a 20 MB request (1M * (128 bits topic ID +
32 bits partition ID)), is this unreasonable?
Regarding 2., currently, when a broker is fenced, metadata records are
generated for every changed ISR set. That'll include all log directories, so
this same concern should apply also for fenced brokers. Do we want to update
this mechanism as well to have a single record identify all failed partitions?
One way to prevent O ( n ) requests and updates is to replace the
ALTER_REPLICA_STATE RPC proposed in KIP-589 with a different set of RPCs:
* ASSIGN_REPLICAS_TO_FAIL_GROUP - Each broker sends this RPC to the controller
after assigning one or more
replicas to any log directory, before activating the replica. It associates
each replica to its assigned log directory. It introduces some extra delay in
assigning new replicas to brokers but to the extra roundtrip and batching but
that might be okay.
* FAIL_GROUP - When a log directory fails, a single small request with the
broker ID, and failure group indicates that all previously associated
replicas are now offline.
The metadata log would include equivalent records to these RPCs. A metadata
delta would be able to identify all the fail replicas from the group.
We can also use the same concept - the conceptual grouping replicas into
"failure groups/domains" to minimize the metadata updates when a broker is
fenced.
WDYT? If this sounds like it may be the general right direction I can work on a
KIP.
was (Author: soarez):
I'm trying to figure out what's the best way forward here.
It is nice that the controller does not decide which log directory gets to host
each replica. This can allow each broker to make better allocation decisions,
e.g. based on disk usage information. This benefit was also discussed as part
of KIP-589 so I think we should try to keep that property if possible.
It would be useful to clarify our concerns here a bit further.
In ZooKeeper mode the broker notifies the controller via a z-node update, which
then triggers a full LeaderAndIsr request back to the broker, and relies on per
partition errors to identify the failed replicas. The approach in [~dengziming]
's PR improves this with a single request that includes all the failed
replicas. So this is already an improvement, as we go from:
{{Broker updates log dir failure z-node without specific replica information ->
ZK updates the z-node -> z-node watch triggered in Controller -> Controller
fetch notification -> ZK replies with notifications -> Controller sends
LeaderAndIsr -> Broker answers -> Controller identifies failed replicas}}
To:
{{Broker sends full list of failed replicas -> Controller identifies failed
replicas}}
However, if there are a lot of replicas in the same log directory that can mean
both that:
# The RPC from the broker to the controller indicating failed replicas in the
log directory can be a very large request including O ( n ) replicas.
# The metadata records generated in the KRaft controller are O ( n ).
Which of these are we aiming to tackle [~cmccabe] ? I may be missing some other
concern here, please let me know.
I also have a couple of questions regarding each of these concerns:
Regarding 1., if we use Topic IDs instead of topic names the request it is
likely to be smaller and have a more predictable size. e.g. offlining one
million partitions should be about a 20 MB request (1M * (128 bits topic ID +
32 bits partition ID)), is this unreasonable?
Regarding 2., currently, when a broker is fenced, metadata records are
generated for every changed ISR set. That'll include all log directories, so
this same concern should apply also for fenced brokers. Do we want to update
this mechanism as well to have a single record identify all failed partitions?
One way to prevent O(n) requests and updates is to replace the
ALTER_REPLICA_STATE RPC proposed in KIP-589 with a different set of RPCs:
* ASSIGN_REPLICAS_TO_FAIL_GROUP - Each broker sends this RPC to the controller
after assigning one or more
replicas to any log directory, before activating the replica. It associates
each replica to its assigned log directory. It introduces some extra delay in
assigning new replicas to brokers but to the extra roundtrip and batching but
that might be okay.
* FAIL_GROUP - When a log directory fails, a single small request with the
broker ID, and failure group indicates that all previously associated
replicas are now offline.
The metadata log would include equivalent records to these RPCs. A metadata
delta would be able to identify all the fail replicas from the group.
We can also use the same concept - the conceptual grouping replicas into
"failure groups/domains" to minimize the metadata updates when a broker is
fenced.
WDYT? If this sounds like it may be the general right direction I can work on a
KIP.
> New RPC for notifying controller of failed replica
> --------------------------------------------------
>
> Key: KAFKA-9837
> URL: https://issues.apache.org/jira/browse/KAFKA-9837
> Project: Kafka
> Issue Type: New Feature
> Components: controller, core
> Reporter: David Arthur
> Assignee: dengziming
> Priority: Major
> Labels: kip-500
>
> This is the tracking ticket for
> [KIP-589|https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller].
> For the bridge release, brokers should no longer use ZooKeeper to notify the
> controller that a log dir has failed. It should instead use an RPC mechanism.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)