Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Colin McCabe Wed, 15 Jul 2020 17:31:01 -0700

On Mon, Jul 13, 2020, at 11:08, Boyang Chen wrote:
> Hey Colin, some quick questions,
> 
> 1. I looked around and didn't find a config for broker heartbeat interval,
> are we piggy-back on some existing configs?
>


Good point.  I meant to add this, but I forgot.  I added 
registration.heartbeat.interval.ms in the table.

>
> 2. We only mentioned that the lease time is 10X of the heartbeat interval,
> could we also include why we chose this value?
> 

I will add registration.lease.timeout.ms so that this can be set separately 
from registration.heartbeat.interval.ms.  The choice of value is a balance 
between not timing out brokers too soon, and not keeping unavailable brokers 
around too long.

best,
Colin

>
> On Mon, Jul 13, 2020 at 10:09 AM Jason Gustafson <ja...@confluent.io> wrote:
> 
> > Hi Colin,
> >
> > Thanks for the proposal. A few initial comments comments/questions below:
> >
> > 1. I don't follow why we need a separate configuration for
> > `controller.listeners`. The current listener configuration already allows
> > users to specify multiple listeners, which allows them to define internal
> > endpoints that are not exposed to clients. Can you explain what the new
> > configuration gives us that we don't already have?
> > 2. What is the advantage of creating a separate `controller.id` instead of
> > just using `broker.id`?
> > 3. It sounds like you are imagining a stop-the-world approach to
> > snapshotting, which is why we need the controller micromanaging snapshots
> > on all followers. Alternatives include fuzzy snapshots which can be done
> > concurrently. If this has been rejected, can you add some detail about why?
> > 4. More of a nit, but should `DeleteBrokerRecord` be
> > `ShutdownBrokerRecord`? The broker is just getting removed from ISRs, but
> > it would still be present in the replica set (I assume).
> >
> > Thanks,
> > Jason
> >
> > On Sun, Jul 12, 2020 at 12:24 AM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > Hi Unmesh,
> > >
> > > That's an interesting idea, but I think it would be best to strive for
> > > single metadata events that are complete in themselves, rather than
> > trying
> > > to do something transactional or EOS-like.  For example, we could have a
> > > create event that contains all the partitions to be created.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Fri, Jul 10, 2020, at 04:12, Unmesh Joshi wrote:
> > > > I was thinking that we might need something like multi-operation
> > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-965> record in
> > > zookeeper
> > > > to atomically create topic and partition records when this multi record
> > > is
> > > > committed.  This way metadata will have both the TopicRecord and
> > > > PartitionRecord together always, and in no situation we can have
> > > > TopicRecord without PartitionRecord. Not sure if there are other
> > > situations
> > > > where multi-operation is needed.
> > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-965>
> > > >
> > > > Thanks,
> > > > Unmesh
> > > >
> > > > On Fri, Jul 10, 2020 at 11:32 AM Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Unmesh,
> > > > >
> > > > > Yes, once the last stable offset advanced, we would consider the
> > topic
> > > > > creation to be done, and then we could return success to the client.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > On Thu, Jul 9, 2020, at 19:44, Unmesh Joshi wrote:
> > > > > > It still needs HighWaterMark / LastStableOffset to be advanced by
> > two
> > > > > > records? Something like following?
> > > > > >
> > > > > >
> > > > > >                        |                |
> > > > > > <------------------    |----------------|   HighWaterMark
> > > > > >    Response            |PartitionRecord |
> > > > > >                        |                |
> > > > > >                        -----------------|
> > > > > >                        | TopicRecord    |
> > -
> > > > > >                        |                |
> > > > > > ------------------->   ------------------   Previous HighWaterMark
> > > > > >    CreateTopic         |                |
> > > > > >                        |                |
> > > > > >                        |                |
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 10, 2020 at 1:30 AM Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > On Thu, Jul 9, 2020, at 04:37, Unmesh Joshi wrote:
> > > > > > > > I see that, when a new topic is created, two metadata records,
> > a
> > > > > > > > TopicRecord (just the name and id of the topic) and a
> > > PartitionRecord
> > > > > > > (more
> > > > > > > > like LeaderAndIsr, with leader id and replica ids for the
> > > partition)
> > > > > are
> > > > > > > > created.
> > > > > > > > While creating the topic, log entries for both the records need
> > > to be
> > > > > > > > committed in RAFT core. Will it need something like a
> > > > > > > MultiOperationRecord
> > > > > > > > in zookeeper. Then, we can have a single log entry with both
> > the
> > > > > records,
> > > > > > > > and  the create topic request can be fulfilled atomically when
> > > both
> > > > > the
> > > > > > > > records are committed?
> > > > > > > >
> > > > > > >
> > > > > > > Hi Unmesh,
> > > > > > >
> > > > > > > Since the active controller is the only node writing to the log,
> > > there
> > > > > is
> > > > > > > no need for any kind of synchronization or access control at the
> > > log
> > > > > level.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Unmesh
> > > > > > > >
> > > > > > > > On Wed, Jul 8, 2020 at 6:57 AM Ron Dagostino <
> > rndg...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > HI Colin.  Thanks for the KIP.  Here is some feedback and
> > > various
> > > > > > > > > questions.
> > > > > > > > >
> > > > > > > > > "*Controller processes will listen on a separate port from
> > > brokers.
> > > > > > > This
> > > > > > > > > will be true even when the broker and controller are
> > > co-located in
> > > > > the
> > > > > > > same
> > > > > > > > > JVM*". I assume it is possible that the port numbers could be
> > > the
> > > > > same
> > > > > > > when
> > > > > > > > > using separate JVMs (i.e. broker uses port 9192 and
> > controller
> > > also
> > > > > > > uses
> > > > > > > > > port 9192).  I think it would be clearer to state this along
> > > these
> > > > > > > > > lines: "Controller
> > > > > > > > > nodes will listen on a port, and the controller port must
> > > differ
> > > > > from
> > > > > > > any
> > > > > > > > > port that a broker in the same JVM is listening on.  In other
> > > > > words, a
> > > > > > > > > controller and a broker node, when in the same JVM, do not
> > > share
> > > > > ports"
> > > > > > > > >
> > > > > > > > > I think the sentence "*In the realm of ACLs, this translates
> > to
> > > > > > > controllers
> > > > > > > > > requiring CLUSTERACTION on CLUSTER for all operations*" is
> > > > > confusing.
> > > > > > > It
> > > > > > > > > feels to me that you can just delete it.  Am I missing
> > > something
> > > > > here?
> > > > > > > > >
> > > > > > > > > The KIP states "*The metadata will be stored in memory on all
> > > the
> > > > > > > active
> > > > > > > > > controllers.*"  Can there be multiple active controllers?
> > > Should
> > > > > it
> > > > > > > > > instead read "The metadata will be stored in memory on all
> > > > > potential
> > > > > > > > > controllers." (or something like that)?
> > > > > > > > >
> > > > > > > > > KIP-595 states "*we have assumed the name __cluster_metadata
> > > for
> > > > > this
> > > > > > > > > topic, but this is not a formal part of this proposal*".
> > This
> > > > > KIP-631
> > > > > > > > > states "*Metadata changes need to be persisted to the
> > > __metadata
> > > > > log
> > > > > > > before
> > > > > > > > > we propagate them to the other nodes in the cluster.  This
> > > means
> > > > > > > waiting
> > > > > > > > > for the metadata log's last stable offset to advance to the
> > > offset
> > > > > of
> > > > > > > the
> > > > > > > > > change.*"  Are we here formally defining "__metadata" as the
> > > topic
> > > > > > > name,
> > > > > > > > > and should these sentences refer to "__metadata topic" rather
> > > than
> > > > > > > > > "__metadata log"?  What are the "other nodes in the cluster"
> > > that
> > > > > are
> > > > > > > > > referred to?  These are not controller nodes but brokers,
> > > right?
> > > > > If
> > > > > > > so,
> > > > > > > > > then should we say "before we propagate them to the brokers"?
> > > > > > > Technically
> > > > > > > > > we have a controller cluster and a broker cluster -- two
> > > separate
> > > > > > > clusters,
> > > > > > > > > correct?  (Even though we could potentially share JVMs and
> > > > > therefore
> > > > > > > > > require no additional processes.). If the statement is
> > > referring to
> > > > > > > nodes
> > > > > > > > > in both clusters then maybe we should state "before we
> > > propagate
> > > > > them
> > > > > > > to
> > > > > > > > > the other nodes in the controller cluster or to brokers."
> > > > > > > > >
> > > > > > > > > "*The controller may have several of these uncommitted
> > changes
> > > in
> > > > > > > flight at
> > > > > > > > > any given time.  In essence, the controller's in-memory state
> > > is
> > > > > > > always a
> > > > > > > > > little bit in the future compared to the current state.  This
> > > > > allows
> > > > > > > the
> > > > > > > > > controller to continue doing things while it waits for the
> > > previous
> > > > > > > changes
> > > > > > > > > to be committed to the Raft log.*"  Should the three
> > references
> > > > > above
> > > > > > > be to
> > > > > > > > > the active controller rather than just the controller?
> > > > > > > > >
> > > > > > > > > "*Therefore, the controller must not make this future state
> > > > > "visible"
> > > > > > > to
> > > > > > > > > the rest of the cluster until it has been made persistent –
> > > that
> > > > > is,
> > > > > > > until
> > > > > > > > > it becomes current state*". Again I wonder if this should
> > > refer to
> > > > > > > "active"
> > > > > > > > > controller, and indicate "anyone else" as opposed to "the
> > rest
> > > of
> > > > > the
> > > > > > > > > cluster" since we are talking about 2 clusters here?
> > > > > > > > >
> > > > > > > > > "*When the active controller decides that it itself should
> > > create a
> > > > > > > > > snapshot, it will first try to give up the leadership of the
> > > Raft
> > > > > > > quorum.*"
> > > > > > > > > Why?  Is it necessary to state this?  It seems like it might
> > > be an
> > > > > > > > > implementation detail rather than a necessary
> > > > > constraint/requirement
> > > > > > > that
> > > > > > > > > we declare publicly and would have to abide by.
> > > > > > > > >
> > > > > > > > > "*It will reject brokers whose metadata is too stale*". Why?
> > > An
> > > > > > > example
> > > > > > > > > might be helpful here.
> > > > > > > > >
> > > > > > > > > "*it may lose subsequent conflicts if its broker epoch is
> > > stale*"
> > > > > This
> > > > > > > is
> > > > > > > > > the first time a "broker epoch" is mentioned.  I am assuming
> > > it is
> > > > > the
> > > > > > > > > controller epoch communicated to it (if any).  It would be
> > > good to
> > > > > > > > > introduce it/explicitly state what it is before referring to
> > > it.
> > > > > > > > >
> > > > > > > > > Ron
> > > > > > > > >
> > > > > > > > > On Tue, Jul 7, 2020 at 6:48 PM Colin McCabe <
> > > cmcc...@apache.org>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I posted a KIP about how the quorum-based controller
> > > envisioned
> > > > > in
> > > > > > > > > KIP-500
> > > > > > > > > > will work.  Please take a look here:
> > > > > > > > > > https://cwiki.apache.org/confluence/x/4RV4CQ
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to