Hi, Colin,

Thanks for the reply. A few more comments.

55. There is still text that favors new broker registration. "When a broker
first starts up, when it is in the INITIAL state, it will always "win"
broker ID conflicts.  However, once it is granted a lease, it transitions
out of the INITIAL state.  Thereafter, it may lose subsequent conflicts if
its broker epoch is stale.  (See KIP-380 for some background on broker
epoch.)  The reason for favoring new processes is to accommodate the common
case where a process is killed with kill -9 and then restarted.  We want it
to be able to reclaim its old ID quickly in this case."

80.1 Sounds good. Could you document that listeners is a required config
now? It would also be useful to annotate other required configs. For
example, controller.connect should be required.

80.2 Could you list all deprecated existing configs? Another one is
control.plane.listener.name since the controller no longer sends
LeaderAndIsr, UpdateMetadata and StopReplica requests.

83.1 It seems that the broker can transition from FENCED to RUNNING without
registering for a new broker epoch. I am not sure how this works. Once the
controller fences a broker, there is no need for the controller to keep the
boker epoch around. So, if the fenced broker's heartbeat request with the
existing broker epoch will be rejected, leading the broker back to the
FENCED state again.

83.5 Good point on KIP-590. Then should we expose the controller for
debugging purposes? If not, we should deprecate the controllerID field in
MetadataResponse?

90. We rejected the shared ID with just one reason "This is not a good idea
because NetworkClient assumes a single ID space.  So if there is both a
controller 1 and a broker 1, we don't have a way of picking the "right"
one." This doesn't seem to be a strong reason. For example, we could
address the NetworkClient issue with the node type as you pointed out or
using the negative value of a broker ID as the controller ID.

100. In KIP-589
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller>,
the broker reports all offline replicas due to a disk failure to the
controller. It seems this information needs to be persisted to the metadata
log. Do we have a corresponding record for that?

101. Currently, StopReplica request has 2 modes, without deletion and with
deletion. The former is used for controlled shutdown and handling disk
failure, and causes the follower to stop. The latter is for topic deletion
and partition reassignment, and causes the replica to be deleted. Since we
are deprecating StopReplica, could we document what triggers the stopping
of a follower and the deleting of a replica now?

102. Should we include the metadata topic in the MetadataResponse? If so,
when it will be included and what will the metadata response look like?

103. "The active controller assigns the broker a new broker epoch, based on
the latest committed offset in the log." This seems inaccurate since the
latest committed offset doesn't always advance on every log append.

104. REGISTERING(1) : It says "Otherwise, the broker moves into the FENCED
state.". It seems this should be RUNNING?

105. RUNNING: Should we require the broker to catch up to the metadata log
to get into this state?

Thanks,

Jun



On Fri, Oct 23, 2020 at 1:20 PM Colin McCabe <cmcc...@apache.org> wrote:

> On Wed, Oct 21, 2020, at 05:51, Tom Bentley wrote:
> > Hi Colin,
> >
> > On Mon, Oct 19, 2020, at 08:59, Ron Dagostino wrote:
> > > > Hi Colin.  Thanks for the hard work on this KIP.
> > > >
> > > > I have some questions about what happens to a broker when it becomes
> > > > fenced (e.g. because it can't send a heartbeat request to keep its
> > > > lease).  The KIP says "When a broker is fenced, it cannot process any
> > > > client requests.  This prevents brokers which are not receiving
> > > > metadata updates or that are not receiving and processing them fast
> > > > enough from causing issues to clients." And in the description of the
> > > > FENCED(4) state it likewise says "While in this state, the broker
> does
> > > > not respond to client requests."  It makes sense that a fenced broker
> > > > should not accept producer requests -- I assume any such requests
> > > > would result in NotLeaderOrFollowerException.  But what about KIP-392
> > > > (fetch from follower) consumer requests?  It is conceivable that
> these
> > > > could continue.  Related to that, would a fenced broker continue to
> > > > fetch data for partitions where it thinks it is a follower?  Even if
> > > > it rejects consumer requests it might still continue to fetch as a
> > > > follower.  Might it be helpful to clarify both decisions here?
> > >
> > > Hi Ron,
> > >
> > > Good question.  I think a fenced broker should continue to fetch on
> > > partitions it was already fetching before it was fenced, unless it
> hits a
> > > problem.  At that point it won't be able to continue, since it doesn't
> have
> > > the new metadata.  For example, it won't know about leadership changes
> in
> > > the partitions it's fetching.  The rationale for continuing to fetch
> is to
> > > try to avoid disruptions as much as possible.
> > >
> > > I don't think fenced brokers should accept client requests.  The issue
> is
> > > that the fenced broker may or may not have any data it is supposed to
> > > have.  It may or may not have applied any configuration changes, etc.
> that
> > > it is supposed to have applied.  So it could get pretty confusing, and
> also
> > > potentially waste the client's time.
> > >
> > >
> > When fenced, how would the broker reply to a client which did make a
> > request?
> >
>
> Hi Tom,
>
> The broker will respond with a retryable error in that case.  Once the
> client has re-fetched its metadata, it will no longer see the fenced broker
> as part of the cluster.  I added a note to the KIP.
>
> best,
> Colin
>
> >
> > Thanks,
> >
> > Tom
> >
>

Reply via email to