Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Colin McCabe Fri, 25 Sep 2020 14:17:33 -0700

On Fri, Sep 25, 2020, at 10:17, Jun Rao wrote:
> Hi, Colin,
> 
> Thanks for the reply.
> 
> 60. Yes, I think you are right. We probably need the controller id when a
> broker starts up. A broker only stores the Raft leader id in the metadata
> file. To do the initial fetch to the Raft leader, it needs to know the
> host/port associated with the leader id.
> 
> 62. It seems there are 2 parts to this : (1) which listener a client should
> use to initiate a connection to the controller and (2) which listener
> should a controller use to accept client requests. For (1), at any point of
> time, a client only needs to use one listener. I think
> controller.listener.name is meant for the client.


Hi Jun,

controller.listener.names is also used by the controllers.  In the case where 
we have a broker and a controller in the same JVM, we have a single config 
file.  Then we need to know which listeners belong to the controller and which 
belong to the broker component.  That's why it's a list.

> So, a single value seems
> to make more sense. Currently, we don't have a configuration for (2). We
> could add a new one for that and support a list. I am wondering how useful
> it will be. One example that I can think of is that we can reject
> non-controller related requests if accepted on controller-only listeners.
> However, we already support separate authentication for the controller
> listener. So, not sure how useful it is.

The controller always has a separate listener and does not share listeners with 
the broker.  The main reason for this is to avoid giving clients the ability to 
launch a denial-of-service attack on the controller by flooding a broker port.  
A lot of times, these attacks are made unintentionally by poorly coded or 
configured clients.  Additionally, broker ports can also be very busy with 
large fetch requests, and so on.  It's just a bad configuration in general to 
try to overload the same port for the controller and broker, and we don't want 
to allow it.  It would be a regression to go from the current system where 
control requests are safely isolated on a separate port, to one where they're 
not.  It also makes the code and configuration a lot messier.

> 
> 63. (a) I think most users won't know controller.id defaults to broker.id +
> 3000. So, it can be confusing for them to set up controller.connect. If
> this is truly needed, it seems that it's less confusing to make
> controller.id required.
> (b) I am still trying to understand if we truly need to expose a
> controller.id. What if we only expose broker.id and let controller.connect
> just use broker.id? What will be missing?

The controller has a separate ID from the broker, so knowing broker.id is not 
helpful here.

best,
Colin

> 
> Thanks,
> 
> Jun
> 
> On Thu, Sep 24, 2020 at 10:55 PM Colin McCabe <[email protected]> wrote:
> 
> > On Thu, Sep 24, 2020, at 16:24, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > Thanks for the reply and the updated KIP. A few more comments below.
> > >
> >
> > Hi Jun,
> >
> > >
> > > 53. It seems that you already incorporated the changes in KIP-516. With
> > > topic ids, we don't need to wait for the topic's data to be deleted
> > before
> > > removing the topic metadata. If the topic is recreated, we can still
> > delete
> > > the data properly based on the topic id. So, it seems that we can remove
> > > TopicRecord.Deleting.
> > >
> >
> > Thanks for the reply.  What I was thinking of doing here was using topic
> > IDs internally, but still using names externally.  So the topic UUIDs are
> > only for the purpose of associating topics with partitions -- from the
> > user's point of view, topics are still identified by names.
> >
> > You're right that KIP-516 will simplify things, but I'm not sure when that
> > will land, so I wanted to avoid blocking the initial implementation of this
> > KIP on it.
> >
> > >
> > > 55. It seems to me that the current behavior where we favor the current
> > > broker registration is better. This is because uncontrolled broker
> > shutdown
> > > is rare and its impact is less severe since one just needs to wait for
> > the
> > > session timeout before restarting the broker. If a mis-configured broker
> > > replaces an existing broker, the consequence is more severe since it can
> > > cause the leader to be unavailable, a replica to be out of ISR, and add
> > > more load on the leaders etc.
> > >
> >
> > Hmm, that's a good point.  Let me check this a bit more before I change
> > it, though.
> >
> > >
> > > 60. [email protected]:9093,
> > > [email protected]:9093,[email protected] : Do we 
> > > need to
> > > include the controller id before @? It seems that the host/port is enough
> > > for establishing the connection. It would also be useful to make it clear
> > > that controller.connect replaces quorum.voters in KIP-595.
> > >
> >
> > I discussed this with Jason earlier, and he felt that the controller IDs
> > were needed in this configuration key.  It is certainly needed when
> > configuring the controllers themselves, since they need to know each
> > others' IDs.
> >
> > >
> > > 61. I am not sure that we need both controller.listeners and
> > > controller.connect.security.protocol since the former implies the
> > security
> > > protocol. The reason that we have both inter.broker.listener.name and
> > > inter.broker.security.protocol is because we had the latter first and
> > later
> > > realized that the former is more general.
> > >
> >
> > That's a good point.  I've removed this from the KIP.
> >
> > >
> > > 62. I am still not sure that you need controller.listeners to be a list.
> > > All listeners are already defined in listeners. To migrate from plaintext
> > > to SSL, one can configure listeners to have both plaintext and SSL. After
> > > that, one can just change controller.listeners from plaintext to SSL.
> > This
> > > is similar to how to change the listener for inter broker connections.
> > > Also, controller.listener.name may be a more accurate name?
> > >
> >
> > The issue that I see here is that if you are running with the controller
> > and broker in the same JVM, if you define a few listeners in "listeners"
> > they will get used as regular broker listeners, unless you put them in
> > controller.listeners.  Therefore, controller.listeners needs to be a list.
> >
> > controller.listener.names does sound like a better name, though... I've
> > updated it to that.
> >
> > >
> > > 63. Regarding controller.id, I am trying to understand whether it's a
> > > required configuration or an optional one. From the KIP, it sounds like
> > > controller.id is optional. Then, if this is unset, it's not clear how
> > the
> > > user will obtain the controller id for setting controller.connect.
> > >
> >
> > If you specify broker.id but not controller.id, then controller.id will
> > just be set to broker.id + 3000.  This was intended to make some
> > configurations a little bit more concise to write.  You still do have to
> > know the controller IDs when configuring the brokers, though.  If this
> > seems confusing then I can just make it mandatory.
> >
> > >
> > > 64. KIP-516 adds a flag in LeaderAndIsrRequest to indicate whether it
> > > includes all partitions or not. This will be used to clean up strayed
> > logs.
> > > I was thinking how we will do the same thing once the controller moves to
> > > Raft based. One way to do that is on broker startup, it gets the HWM in
> > the
> > > metadata log from the Raft leader and waits until its metadata catches up
> > > to HWM. At that point, any local log not seen in the metadata can be
> > > removed. Since the Fetch response returns the HWM, there seems to be
> > enough
> > > APIs to achieve this.
> > >
> >
> > That's a very good point.  I added a note about this under Broker Startup.
> >
> > best,
> > Colin
> >
> > >
> > > Jun
> > >
> > >
> > >
> > > On Thu, Sep 24, 2020 at 1:07 PM Colin McCabe <[email protected]> wrote:
> > >
> > > > On Mon, Sep 21, 2020, at 18:13, Jun Rao wrote:
> > > > > Hi, Colin,
> > > > >
> > > > > Sorry for the late reply. A few more comments below.
> > > > >
> > > >
> > > > Hi Jun,
> > > >
> > > > Thanks for taking another look.
> > > >
> > > > >
> > > > > 50. Configurations
> > > > > 50.1 controller.listeners: It seems that a controller just needs one
> > > > > listener. Why do we need to have a list here? Also, could you
> > provide an
> > > > > example of how this is set and what's its relationship with existing
> > > > > configs such as "security.inter.broker.protocol" and "
> > > > > inter.broker.listener.name"?
> > > > >
> > > >
> > > > I agree that most administrators will want to run with only one
> > controller
> > > > listener.  However, just as with brokers, it is nice to have the
> > option to
> > > > expose multiple ports.
> > > >
> > > > One reason why you might want multiple ports is if you were doing a
> > > > migration from plaintext to SSL.  You could add new SSL listeners but
> > > > continue to expose the PLAINTEXT listeners.  Then you could add the
> > new SSL
> > > > controller config to each broker and roll each broker.  Then you could
> > > > migrate from PLAINTEXT to SSL with no downtime.
> > > >
> > > > Here's an example configuration for the controller:
> > > > [email protected]:9093,
> > > > [email protected]:9093,[email protected]
> > > > controller.listeners=CONTROLLER
> > > > listeners=CONTROLLER://controller0.example.com:9093
> > > > listener.security.protocol.map=CONTROLLER:SSL
> > > >
> > > > Here's an example configuration for the broker:
> > > > [email protected]:9093,
> > > > [email protected]:9093,[email protected]
> > > > controller.connect.security.protocol=SSL
> > > >
> > > > security.inter.broker.protocol or inter.broker.listener.name do not
> > > > affect how the broker communicates with the controller.  Those
> > > > configurations relate to how brokers communicate with each other, but
> > the
> > > > controller is not a broker.
> > > >
> > > > (Note that I just added controller.connect.security.protocol to the
> > KIP --
> > > > I had forgotten to put it in earlier)
> > > >
> > > > >
> > > > > 50.2 registration.heartbeat.interval.ms and
> > > > registration.lease.timeout.ms.
> > > > > Should we match their default value with the corresponding default
> > for
> > > > ZK?
> > > > >
> > > >
> > > > Fair enough.  I'll set them to the values of the
> > zookeeper.sync.time.ms
> > > > and zookeeper.connection.timeout.ms configurations.  I do think we
> > should
> > > > experiment later on to see what works well here, but the ZK values are
> > at
> > > > least a starting point.
> > > >
> > > > >
> > > > > 50.3 controller.connect: Could you provide an example? I am
> > wondering how
> > > > > it differs from quorum.voters=1@kafka-1:9092, 2@kafka-2:9092,
> > 3@kafka-3
> > > > > :9092.
> > > > >
> > > >
> > > > controller.connect is intended to be the new name for quorum.voters.
> > > > During the vote for KIP-595, we sort of agreed to defer the discussion
> > > > about what this configuration should be called.  I proposed this new
> > name
> > > > because it makes it clear what the configuration is for (how to
> > connect to
> > > > the controllers).
> > > >
> > > > >
> > > > > 50.4 controller.id: I am still not sure how this is being used.
> > Could
> > > > you
> > > > > explain this in more detail?
> > > > >
> > > >
> > > > The controller ID needs to be set on each controller.  It also appears
> > in
> > > > controller.connect as the thing before the "at" sign.  Its function is
> > > > pretty similar to broker.id.
> > > >
> > > > Broker IDs and controller IDs exist in the same ID space, so you can't
> > > > have a broker and a controller that use the same ID.
> > > >
> > > > >
> > > > > 51. BrokerHeartbeat: It seems a bit wasteful to include Listeners in
> > > > every
> > > > > heartbeat request since it typically doesn't change. Could we make
> > that
> > > > an
> > > > > optional field?
> > > > >
> > > >
> > > > Ok.
> > > >
> > > > >
> > > > > 52. KIP-584 adds a new ZK node /features. Should we add a
> > corresponding
> > > > > metadata record?
> > > > >
> > > >
> > > > Good point.  I added FeatureLevelRecord.
> > > >
> > > > >
> > > > > 53. TopicRecord and DeleteTopic: Both DeleteTopic and
> > > > TopicRecord.Deleting
> > > > > indicate topic deletion. Could we outline the flow when each will be
> > set?
> > > > > In particular, which one indicates the intention to delete and which
> > one
> > > > > indicates the completion of the deletion.
> > > > >
> > > >
> > > > TopicRecord.Deleting is set when we intend to delete the topic.
> > > >
> > > > DeleteTopic removes the topic completely.  I will rename DeleteTopic to
> > > > RemoveTopic to make this clearer.
> > > >
> > > > >
> > > > > 54. "The controller can generate a new broker epoch by using the
> > latest
> > > > log
> > > > > offset." Which offset is that? Is it the offset of the metadata
> > topic for
> > > > > the corresponding BrokerRecord? Is it guaranteed to be different on
> > each
> > > > > broker restart?
> > > > >
> > > >
> > > > Yes.... a new broker epoch implies that there has been a new record
> > > > created in the metadata log.  Therefore the last committed offset must
> > be
> > > > different.
> > > >
> > > > >
> > > > > 55. "Thereafter, it may lose subsequent conflicts if its broker
> > epoch is
> > > > > stale.  (See KIP-380 for some background on broker epoch.)  The
> > reason
> > > > for
> > > > > favoring new processes is to accommodate the common case where a
> > process
> > > > is
> > > > > killed with kill -9 and then restarted. " Are you saying that if
> > there is
> > > > > an active broker registered in the controller, a new broker heartbeat
> > > > > request with the INITIAL state will fence the current broker
> > session? Not
> > > > > sure about this. For example, this will allow a broker with
> > incorrectly
> > > > > configured broker id to kill an existing valid broker.
> > > > >
> > > >
> > > > Yes, a new broker with an incorrectly configured broker id could fence
> > an
> > > > existing valid broker.
> > > >
> > > > The goal of the conflict resolution process described here is to avoid
> > > > having two brokers go back and forth in claiming the broker id.
> > Choosing
> > > > the newest was just an easy way to do that.
> > > >
> > > > Choosing the oldest is another possibility, and more inline with what
> > > > ZooKeeper does today.  The problem with that is that it would not be
> > > > possible to re-create the broker if the old instance died suddenly,
> > except
> > > > by waiting for the full lease timeout.
> > > >
> > > > It might be possible to do a little better here by breaking the lease
> > if
> > > > the TCP connection from the broker drops.  But that requires crossing
> > some
> > > > layers of abstraction in the Kafka networking stack.  There's also the
> > > > possibility of false negatives or positives.  For example, TCP
> > connections
> > > > sometimes just drop even if the node is still there.  Or sometimes the
> > > > networking stack keeps a TCP connection alive for a while when the
> > other
> > > > end is gone.
> > > >
> > > > >
> > > > > 56. kafka-storage.sh:
> > > > > 56.1 In the info mode, what other information does it show in
> > addition to
> > > > > "kip.500.mode=enabled"?
> > > > >
> > > >
> > > > I think it should show a list of all the storage directories, and
> > whether
> > > > each one is formatted.  In addition, it should show whether
> > kip.500.mode is
> > > > enabled, and what the cluster id is.
> > > >
> > > > >
> > > > > 56.2 Should the format mode take the config file as the input too
> > like
> > > > the
> > > > > info mode?
> > > > >
> > > >
> > > > Yes, that's a good idea.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Sep 17, 2020 at 4:12 PM Colin McCabe <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi Unmesh,
> > > > > >
> > > > > > That's a fair point.  I have moved the lease duration to the broker
> > > > > > heartbeat response.  That way lease durations can be changed just
> > be
> > > > > > reconfiguring the controllers.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > > On Wed, Sep 16, 2020, at 07:40, Unmesh Joshi wrote:
> > > > > > > Thanks Colin, the changes look good to me. One small thing.
> > > > > > > registration.lease.timeout.ms is the configuration on the
> > controller
> > > > > > side.
> > > > > > > It will be good to comment on how brokers know about it, to be
> > able
> > > > to
> > > > > > > send LeaseDurationMs
> > > > > > > in the heartbeat request,
> > > > > > > or else it can be added in the heartbeat response for brokers to
> > know
> > > > > > about
> > > > > > > it.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Unmesh
> > > > > > >
> > > > > > > On Fri, Sep 11, 2020 at 10:32 PM Colin McCabe <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Unmesh,
> > > > > > > >
> > > > > > > > I think you're right that we should use a duration here rather
> > > > than a
> > > > > > > > time.  As you said, the clock on the controller will probably
> > not
> > > > > > match the
> > > > > > > > one on the broker.  I have updated the KIP.
> > > > > > > >
> > > > > > > > > > It's important to keep in mind that messages may be
> > delayed in
> > > > the
> > > > > > > > > > network, or arrive out of order.  When this happens, we
> > will
> > > > use
> > > > > > the
> > > > > > > > start
> > > > > > > > > > time specified in the request to determine if the request
> > is
> > > > stale.
> > > > > > > > > I am assuming that there will be a single TCP connection
> > > > maintained
> > > > > > > > between
> > > > > > > > > broker and active controller. So, there won't be any out of
> > order
> > > > > > > > requests?
> > > > > > > > > There will be a scenario of broker GC pause, which might
> > cause
> > > > > > connection
> > > > > > > > > timeout and broker might need to reestablish the connection.
> > If
> > > > the
> > > > > > pause
> > > > > > > > > is too long, lease will expire and the heartbeat sent after
> > the
> > > > pause
> > > > > > > > will
> > > > > > > > > be treated as a new registration (similar to restart case),
> > and
> > > > a new
> > > > > > > > epoch
> > > > > > > > > number will be assigned to the broker.
> > > > > > > >
> > > > > > > > I agree with the end of this paragraph, but not with the start
> > :)
> > > > > > > >
> > > > > > > > There can be out-of-order requests, since the broker will
> > simply
> > > > use a
> > > > > > new
> > > > > > > > TCP connection if the old one has problems.  This can happen
> > for a
> > > > > > variety
> > > > > > > > of reasons.  I don't think GC pauses are the most common
> > reason for
> > > > > > this to
> > > > > > > > happen.  It's more common to see issues issues in the network
> > > > itself
> > > > > > that
> > > > > > > > result connections getting dropped from time to time.
> > > > > > > >
> > > > > > > > So we have to assume that messages may arrive out of order, and
> > > > > > possibly
> > > > > > > > be delayed.  I added a note that heartbeat requests should be
> > > > ignored
> > > > > > if
> > > > > > > > the metadata log offset they contain is smaller than a previous
> > > > > > heartbeat.
> > > > > > > >
> > > > > > > > > When the active controller fails, the new active controller
> > > > needs to
> > > > > > be
> > > > > > > > > sure that it considers all the known brokers as alive till
> > the
> > > > lease
> > > > > > > > > expiration interval.  Because registration.lease.timeout.ms,
> > is
> > > > > > > > configured
> > > > > > > > > on the controller, the new active controller will extend all
> > the
> > > > > > leases
> > > > > > > > by
> > > > > > > > > registration.lease.timeout.ms. I see that it won't need last
> > > > > > heartbeat
> > > > > > > > > time.
> > > > > > > >
> > > > > > > > Agreed.
> > > > > > > >
> > > > > > > > best,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Unmesh
> > > > > > > > >
> > > > > > > > > On Sat, Sep 5, 2020 at 1:28 AM Colin McCabe <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > Colin wrote:
> > > > > > > > > > > > The reason for including LeaseStartTimeMs in the
> > request
> > > > is to
> > > > > > > > ensure
> > > > > > > > > > > > that the time required to communicate with the
> > controller
> > > > gets
> > > > > > > > > > included in
> > > > > > > > > > > > the lease time.  Since requests can potentially be
> > delayed
> > > > in
> > > > > > the
> > > > > > > > > > network
> > > > > > > > > > > > for a long time, this is important.
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 31, 2020, at 05:58, Unmesh Joshi wrote:
> > > > > > > > > > > The network time will be added anyway, because the lease
> > > > timer
> > > > > > on the
> > > > > > > > > > > active controller will start only after the heartbeat
> > request
> > > > > > > > reaches the
> > > > > > > > > > > server.
> > > > > > > > > >
> > > > > > > > > > Hi Unmesh,
> > > > > > > > > >
> > > > > > > > > > If the start time is not specified in the request, then the
> > > > network
> > > > > > > > time
> > > > > > > > > > is excluded from the heartbeat time.
> > > > > > > > > >
> > > > > > > > > > Here's an example:
> > > > > > > > > > Let's say broker A sends a heartbeat at time 100, and it
> > > > arrives
> > > > > > on the
> > > > > > > > > > controller at time 200, and the lease duration is 1000.
> > > > > > > > > >
> > > > > > > > > > The controller looks at the start time in the request,
> > which is
> > > > > > 100,
> > > > > > > > and
> > > > > > > > > > adds 1000 to it, getting 1100.  On the other hand, if start
> > > > time
> > > > > > is not
> > > > > > > > > > specified in the request, then the expiration will be at
> > time
> > > > 1200.
> > > > > > > > > > That is what I mean by "the network time is included in the
> > > > > > expiration
> > > > > > > > > > time."
> > > > > > > > > >
> > > > > > > > > > > And I think, some assumption about network round trip
> > time is
> > > > > > > > > > > needed anyway to decide on the frequency of the
> > heartbeat (
> > > > > > > > > > > registration.heartbeat.interval.ms), and lease timeout (
> > > > > > > > > > > registration.lease.timeout.ms). So I think just having a
> > > > > > leaseTTL
> > > > > > > > in the
> > > > > > > > > > > request is easier to understand and implement.
> > > > > > > > > >
> > > > > > > > > > It's important to keep in mind that messages may be
> > delayed in
> > > > the
> > > > > > > > > > network, or arrive out of order.  When this happens, we
> > will
> > > > use
> > > > > > the
> > > > > > > > start
> > > > > > > > > > time specified in the request to determine if the request
> > is
> > > > stale.
> > > > > > > > > >
> > > > > > > > > > > > Yes, I agree that the lease timeout on the controller
> > side
> > > > > > should
> > > > > > > > be
> > > > > > > > > > > > reset in the case of controller failover.  The
> > alternative
> > > > > > would
> > > > > > > > be to
> > > > > > > > > > > > track the lease as hard state rather than soft state,
> > but I
> > > > > > think
> > > > > > > > that
> > > > > > > > > > > > is not really needed, and would result in more log
> > entries.
> > > > > > > > > > > My interpretation of the mention of BrokerRecord in the
> > KIP
> > > > was
> > > > > > that
> > > > > > > > this
> > > > > > > > > > > record exists in the Raft log.
> > > > > > > > > >
> > > > > > > > > > BrokerRecord does exist in the Raft log, but does not
> > include
> > > > the
> > > > > > last
> > > > > > > > > > heartbeat time.
> > > > > > > > > >
> > > > > > > > > > > By soft state, do you mean the broker
> > > > > > > > > > > records exist only on the active leader and will not be
> > > > > > replicated
> > > > > > > > in the
> > > > > > > > > > > raft log? If the live brokers list is maintained only on
> > the
> > > > > > active
> > > > > > > > > > > controller (raft leader), then, in case of leader
> > failure,
> > > > there
> > > > > > > > will be
> > > > > > > > > > a
> > > > > > > > > > > window where the new leader does not know about the live
> > > > brokers,
> > > > > > > > till
> > > > > > > > > > the
> > > > > > > > > > > brokers establish the leases again.
> > > > > > > > > > > I think it will be safer to have leases as a hard state
> > > > managed
> > > > > > by
> > > > > > > > > > standard
> > > > > > > > > > > Raft replication.
> > > > > > > > > >
> > > > > > > > > > Leases are short, so the need to re-establish them after a
> > > > > > controller
> > > > > > > > > > failover doesn't seem like a big problem.  But this is
> > > > something
> > > > > > we can
> > > > > > > > > > tweak if it becomes an issue.  One option would be to have
> > a
> > > > > > separate
> > > > > > > > log
> > > > > > > > > > which is only used by the controller nodes for this (since,
> > > > after
> > > > > > all,
> > > > > > > > > > brokers don't care about registration renewals).
> > > > > > > > > >
> > > > > > > > > > > Or am I misunderstanding something? (I assume that with
> > soft
> > > > > > state,
> > > > > > > > you
> > > > > > > > > > > mean something like zookeeper local sessions
> > > > > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1147.)
> > > > > > > > > > >
> > > > > > > > > > > > Our code is single threaded as well.  I think it makes
> > > > sense
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > controller, since otherwise locking becomes very messy.
> > > > I'm
> > > > > > not
> > > > > > > > sure I
> > > > > > > > > > > > understand your question about duplicate broker ID
> > > > detection,
> > > > > > > > though.
> > > > > > > > > > > There's a section in the KIP about this -- is there a
> > detail
> > > > we
> > > > > > > > should
> > > > > > > > > > add
> > > > > > > > > > > there?
> > > > > > > > > >
> > > > > > > > > > This is an implementation detail that doesn't need to be
> > in the
> > > > > > KIP.
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > > > > I assumed broker leases are implemented as a hard state.
> > In
> > > > that
> > > > > > > > case, to
> > > > > > > > > > > check for broker id conflict, we need to check the broker
> > > > ids at
> > > > > > two
> > > > > > > > > > places
> > > > > > > > > > > 1. Pending broker registrations (which are yet to be
> > > > committed)
> > > > > > 2.
> > > > > > > > > > Already
> > > > > > > > > > > committed broker registrations.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Unmesh
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 31, 2020 at 5:42 PM Colin McCabe <
> > > > [email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Sat, Aug 29, 2020, at 01:12, Unmesh Joshi wrote:
> > > > > > > > > > > > > >>>Can you repeat your questions about broker leases?
> > > > > > > > > > > > >
> > > > > > > > > > > > > >>>>The LeaseStartTimeMs is expected to be the
> > broker's
> > > > > > > > > > > > > 'System.currentTimeMillis()' at the point of the
> > > > request. The
> > > > > > > > active
> > > > > > > > > > > > > controller will add its lease period to this in order
> > > > >>>>to
> > > > > > > > compute
> > > > > > > > > > the
> > > > > > > > > > > > > LeaseEndTimeMs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think the use of LeaseStartTimeMs and
> > LeaseEndTimeMs
> > > > in the
> > > > > > > > KIP is
> > > > > > > > > > a
> > > > > > > > > > > > > bit
> > > > > > > > > > > > > confusing.  Monotonic Clock (System.nanoTime) on the
> > > > active
> > > > > > > > > > controller
> > > > > > > > > > > > > should be used to track leases.
> > > > > > > > > > > > > (For example,
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1616https://github.com/etcd-io/etcd/pull/6888/commits/e7f4010ccaf28b6ce64fe514d25a4b2fa459d114
> > > > > > > > > > > > > )
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then we will not need LeaseStartTimeMs?
> > > > > > > > > > > > > Instead of LeaseStartTimeMs, can we call it
> > LeaseTTL? The
> > > > > > active
> > > > > > > > > > > > controller
> > > > > > > > > > > > > can then calculate LeaseEndTime = System.nanoTime() +
> > > > > > LeaseTTL.
> > > > > > > > > > > > > In this case we might just drop LeaseEndTimeMs from
> > the
> > > > > > > > response, as
> > > > > > > > > > the
> > > > > > > > > > > > > broker already knows about the TTL and can send
> > > > heartbeats at
> > > > > > > > some
> > > > > > > > > > > > fraction
> > > > > > > > > > > > > of TTL, say every TTL/4 milliseconds.(elapsed time
> > on the
> > > > > > broker
> > > > > > > > > > measured
> > > > > > > > > > > > > by System.nanoTime)
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > >
> > > > > > > > > > > > I agree that the monotonic clock is probably a better
> > idea
> > > > > > here.
> > > > > > > > It is
> > > > > > > > > > > > good to be robust against wall clock changes, although
> > I
> > > > think
> > > > > > a
> > > > > > > > > > cluster
> > > > > > > > > > > > which had them might suffer other issues.  I will
> > change
> > > > it to
> > > > > > > > specify
> > > > > > > > > > a
> > > > > > > > > > > > monotonic clock.
> > > > > > > > > > > >
> > > > > > > > > > > > The reason for including LeaseStartTimeMs in the
> > request
> > > > is to
> > > > > > > > ensure
> > > > > > > > > > that
> > > > > > > > > > > > the time required to communicate with the controller
> > gets
> > > > > > included
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > lease time.  Since requests can potentially be delayed
> > in
> > > > the
> > > > > > > > network
> > > > > > > > > > for a
> > > > > > > > > > > > long time, this is important.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have a prototype built to demonstrate this as
> > > > following:
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > >
> > > > > > > > > > > > > The Kip631Controller itself depends on a Consensus
> > > > module, to
> > > > > > > > > > demonstrate
> > > > > > > > > > > > > how possible interactions with the consensus module
> > will
> > > > look
> > > > > > > > like
> > > > > > > > > > > > >  (The Consensus can be pluggable really, with an API
> > to
> > > > allow
> > > > > > > > reading
> > > > > > > > > > > > > replicated log upto HighWaterMark)
> > > > > > > > > > > > >
> > > > > > > > > > > > > It has an implementation of LeaseTracker
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/LeaderLeaseTracker.scala
> > > > > > > > > > > > > to demonstrate LeaseTracker's interaction with the
> > > > consensus
> > > > > > > > module.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The implementation has the following aspects:
> > > > > > > > > > > > > 1. The lease tracking happens only on the active
> > > > controller
> > > > > > (raft
> > > > > > > > > > > > > leader)
> > > > > > > > > > > > > 2. Once the lease expires, it needs to propose and
> > > > commit a
> > > > > > > > > > FenceBroker
> > > > > > > > > > > > > record for that lease.
> > > > > > > > > > > > > 3. In case of active controller failure, the lease
> > will
> > > > be
> > > > > > > > tracked by
> > > > > > > > > > > > > the
> > > > > > > > > > > > > newly raft leader. The new raft leader starts the
> > lease
> > > > timer
> > > > > > > > again,
> > > > > > > > > > (as
> > > > > > > > > > > > > implemented in onBecomingLeader method of
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > > )
> > > > > > > > > > > > > in effect extending the lease by the time spent in
> > the
> > > > leader
> > > > > > > > > > election
> > > > > > > > > > > > > and
> > > > > > > > > > > > > whatever time was elapsed on the old leader.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, I agree that the lease timeout on the controller
> > side
> > > > > > should
> > > > > > > > be
> > > > > > > > > > reset
> > > > > > > > > > > > in the case of controller failover.  The alternative
> > would
> > > > be
> > > > > > to
> > > > > > > > track
> > > > > > > > > > the
> > > > > > > > > > > > lease as hard state rather than soft state, but I think
> > > > that
> > > > > > is not
> > > > > > > > > > really
> > > > > > > > > > > > needed, and would result in more log entries.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are working tests for this implementation here.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/kip500/Kip631ControllerTest.scala
> > > > > > > > > > > > > and an end to end test here
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/ProducerConsumerKIP500Test.scala
> > > > > > > > > > > > > <
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/kip500/Kip631ControllerTest.scala
> > > > > > > > > > > > >
> > > > > > > > > > > > > >>'m not sure what you mean by "de-duplication of the
> > > > > > broker."
> > > > > > > > Can
> > > > > > > > > > you
> > > > > > > > > > > > > give a little more context?
> > > > > > > > > > > > > Apologies for using the confusing term
> > deduplication. I
> > > > meant
> > > > > > > > broker
> > > > > > > > > > id
> > > > > > > > > > > > > conflict.
> > > > > > > > > > > > > As you can see in the prototype handleRequest of
> > > > > > KIP631Controller
> > > > > > > > > > > > > <
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > >,
> > > > > > > > > > > > > the duplicate broker id needs to be detected before
> > the
> > > > > > > > BrokerRecord
> > > > > > > > > > is
> > > > > > > > > > > > > submitted to the raft module.
> > > > > > > > > > > > > Also as implemented in the prototype, the
> > > > KIP631Controller is
> > > > > > > > single
> > > > > > > > > > > > > threaded, handling requests one at a time. (an
> > example of
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://martinfowler.com/articles/patterns-of-distributed-systems/singular-update-queue.html
> > > > > > > > > > > > > )
> > > > > > > > > > > >
> > > > > > > > > > > > Our code is single threaded as well.  I think it makes
> > > > sense
> > > > > > for
> > > > > > > > the
> > > > > > > > > > > > controller, since otherwise locking becomes very messy.
> > > > I'm
> > > > > > not
> > > > > > > > sure I
> > > > > > > > > > > > understand your question about duplicate broker ID
> > > > detection,
> > > > > > > > though.
> > > > > > > > > > > > There's a section in the KIP about this -- is there a
> > > > detail we
> > > > > > > > should
> > > > > > > > > > add
> > > > > > > > > > > > there?
> > > > > > > > > > > >
> > > > > > > > > > > > best,
> > > > > > > > > > > > Colin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Aug 29, 2020 at 10:49 AM Colin McCabe <
> > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Aug 28, 2020, at 19:36, Unmesh Joshi wrote:
> > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There were a few of questions I had..
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the response.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Were my comments on the broker lease
> > > > implementation
> > > > > > (and
> > > > > > > > > > > > corresponding
> > > > > > > > > > > > > > > prototype) appropriate and do we need to change
> > the
> > > > KIP
> > > > > > > > > > > > > > > description accordingly?.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Can you repeat your questions about broker leases?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. How will broker epochs be generated? I am
> > > > assuming it
> > > > > > can
> > > > > > > > be
> > > > > > > > > > the
> > > > > > > > > > > > > > > committed log offset (like zxid?)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > There isn't any need to use a log offset.  We can
> > just
> > > > > > look at
> > > > > > > > an
> > > > > > > > > > > > > > in-memory hash table and see what the latest
> > number is,
> > > > > > and add
> > > > > > > > > > one, to
> > > > > > > > > > > > > > generate a new broker epoch.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. How will producer registration happen? I am
> > > > assuming
> > > > > > it
> > > > > > > > > > should be
> > > > > > > > > > > > > > > similar to broker registration, with a similar
> > way to
> > > > > > > > generate
> > > > > > > > > > > > producer
> > > > > > > > > > > > > > id.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For the EOS stuff, we will need a few new RPCs to
> > the
> > > > > > > > controller.
> > > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > > > we should do that in a follow-on KIP, though, since
> > > > this
> > > > > > one is
> > > > > > > > > > already
> > > > > > > > > > > > > > pretty big.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. Because we expose Raft log to all the
> > brokers, any
> > > > > > > > > > de-duplication
> > > > > > > > > > > > of
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > broker needs to happen before the requests are
> > > > proposed
> > > > > > to
> > > > > > > > Raft.
> > > > > > > > > > For
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > the controller needs to be single threaded, and
> > > > should do
> > > > > > > > > > validation
> > > > > > > > > > > > > > > against the in-process or pending requests and
> > the
> > > > final
> > > > > > > > state. I
> > > > > > > > > > > > read a
> > > > > > > > > > > > > > > mention of this, in the responses in this
> > > > thread.Will it
> > > > > > be
> > > > > > > > > > useful to
> > > > > > > > > > > > > > > mention this in the KIP?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm not sure what you mean by "de-duplication of
> > the
> > > > > > broker."
> > > > > > > > Can
> > > > > > > > > > you
> > > > > > > > > > > > > > give a little more context?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sat, Aug 29, 2020 at 4:50 AM Colin McCabe <
> > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm thinking of calling a vote on KIP-631 on
> > > > Monday.
> > > > > > Let
> > > > > > > > me
> > > > > > > > > > know
> > > > > > > > > > > > if
> > > > > > > > > > > > > > > > there's any more comments I should address
> > before I
> > > > > > start
> > > > > > > > the
> > > > > > > > > > vote.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > cheers,
> > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Aug 11, 2020, at 05:39, Unmesh Joshi
> > wrote:
> > > > > > > > > > > > > > > > > >>Hi Unmesh,
> > > > > > > > > > > > > > > > > >>Thanks, I'll take a look.
> > > > > > > > > > > > > > > > > Thanks. I will be adding more to the
> > prototype
> > > > and
> > > > > > will
> > > > > > > > be
> > > > > > > > > > happy
> > > > > > > > > > > > to
> > > > > > > > > > > > > > help
> > > > > > > > > > > > > > > > > and collaborate.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Aug 11, 2020 at 12:28 AM Colin
> > McCabe <
> > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Jose,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > That'a s good point that I hadn't
> > considered.
> > > > It's
> > > > > > > > > > probably
> > > > > > > > > > > > worth
> > > > > > > > > > > > > > > > having
> > > > > > > > > > > > > > > > > > a separate leader change message, as you
> > > > mentioned.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks, I'll take a look.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Aug 7, 2020, at 11:56, Jose Garcia
> > > > Sancio
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Very cool prototype!
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The KIP proposes a record called
> > IsrChange
> > > > which
> > > > > > > > > > includes the
> > > > > > > > > > > > > > > > > > > partition, topic, isr, leader and leader
> > > > epoch.
> > > > > > > > During
> > > > > > > > > > normal
> > > > > > > > > > > > > > > > > > > operation ISR changes do not result in
> > leader
> > > > > > > > changes.
> > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > > > > > > > leader changes do not necessarily
> > involve ISR
> > > > > > > > changes.
> > > > > > > > > > The
> > > > > > > > > > > > > > controller
> > > > > > > > > > > > > > > > > > > implementation that uses ZK modeled them
> > > > together
> > > > > > > > because
> > > > > > > > > > > > > > > > > > > 1. All of this information is stored in
> > one
> > > > > > znode.
> > > > > > > > > > > > > > > > > > > 2. ZK's optimistic lock requires that you
> > > > specify
> > > > > > > > the new
> > > > > > > > > > > > value
> > > > > > > > > > > > > > > > > > completely
> > > > > > > > > > > > > > > > > > > 3. The change to that znode was being
> > > > performed
> > > > > > by
> > > > > > > > both
> > > > > > > > > > the
> > > > > > > > > > > > > > > > controller
> > > > > > > > > > > > > > > > > > > and the leader.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > None of these reasons are true in
> > KIP-500.
> > > > Have
> > > > > > we
> > > > > > > > > > considered
> > > > > > > > > > > > > > having
> > > > > > > > > > > > > > > > > > > two different records? For example
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. IsrChange record which includes topic,
> > > > > > partition,
> > > > > > > > isr
> > > > > > > > > > > > > > > > > > > 2. LeaderChange record which includes
> > topic,
> > > > > > > > partition,
> > > > > > > > > > > > leader
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > leader epoch.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I suspect that making this change will
> > also
> > > > > > require
> > > > > > > > > > changing
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > message AlterIsrRequest introduced in
> > > > KIP-497:
> > > > > > Add
> > > > > > > > > > > > inter-broker
> > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > alter ISR.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > > > > -Jose
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to