Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Tom Bentley Thu, 09 Jul 2020 10:42:22 -0700

Hi Colin,

Thanks for the detailed KIP.

>> "As described in KIP-500, the controller will store its data in the
internal __kafka_metadata topic."

The KIP doesn't really explain the extent to which this is a normal topic.
I know you say in the Networking section that the only time a client should
contact a controller directly is for debugging, that's fine, I'm just
wondering whether, if at all, the existing RPCs can be used to interact
with it.

* Is it present in a MetadataResponse returned to a client by a broker, or
must the request be sent to a controller?
* Presumably I don't ever reassign the replicas of its single partition
(the mechanism for increasing replication would be to spin up more
controllers)
* Can I configure it in any way using ALTER_CONFIGS?
* Can a consumer fetch from it? I think KIP-595 implies the answer to this
is "yes".

>> "Metadata changes need to be persisted to the __kafka_metadata log
before we propagate them to the other nodes in the cluster."

Do you mean before we propagate them to other *brokers* in the cluster,
since propagation to other members of the quorum is a prerequisite of the
advancement of the LSO?

>> "When the active controller decides that a standby controller should
start a snapshot,"

What will govern this? It is simply a certain threshold of changes since
the last snapshot, or the size of the log accrued since the last snapshot?

>> "When the active controller decides that it itself should create a
snapshot, it will first try to give up the leadership of the Raft quorum."

I know Ron asked this, it's not clear to me why this is necessary (or at
least desirable). Also, what happens if the attempt to give up leadership
fails, does the leader snapshot anyway?

>> "When the broker accepts the registration, it grants or renews a broker
ID lease associating the broker process with its ID.  Leases are
time-bounded."

Order of magnitude, how long would a lease typically be? I assume it's some
small multiple of the heartbeat interval.

>> "When a broker first starts up, before it has received any responses
from the controller, it will always "win" broker ID conflicts.  "

I find this sentence confusing because it's written from the perspective of
the broker, but then how there can be any contest that the broker can win
before it has received an initial response from the active controller. So I
think this contest must happen on the active controller, but then how can
it know whether the broker has received a response? All the active
controller can know is either:

* No response sent yet
* Response sent, but no evidence of receipt
* Receipt of response inferred from subsequent request (perhaps with
BrokerEpoch != -1 ?)

So I think a slightly more clear explanation would help.

>> The Broker State Machine

In the diagram the edge connecting SHUTDOWN and FENCED lacks an arrow (I
assume there should be one at the fenced end?)

>> controller.connect

The description says *brokers* use this at startup. Does that mean that
controllers have some other configuration for connecting to each other?

>> BrokerHeartbeatRequest.LeaseStartTimeMs

I think a sentence to say that this is simply the broker's
`System.currentTimeMillis()` at the point of request (to which the active
controller adds some lease period to compute the LeaseEndTimeMs) would be
helpful to avoid people thinking about brokers trying to request leases to
start in the future (relative to their clock).

A couple of more general comments:

* While I understand how the addition of records to the log allows for the
creation of broker, topic etc metadata, I don't understand how the metadata
for an entity would be deleted. How does topic metadata get removed from
the log? I think the TopicRecord.Deleting field is about handling async
delete between the brokers rather than causing the metadata for that topic
to be removed from the current state (and the next snapshot).

* Are you expecting that the controller will still be single threaded?

Many thanks,

Tom

On Thu, Jul 9, 2020 at 12:44 PM Unmesh Joshi <unmeshjo...@gmail.com> wrote:

> I see that, when a new topic is created, two metadata records, a
> TopicRecord (just the name and id of the topic) and a PartitionRecord (more
> like LeaderAndIsr, with leader id and replica ids for the partition) are
> created.
> While creating the topic, log entries for both the records need to be
> committed in RAFT core. Will it need something like a MultiOperationRecord
> in zookeeper. Then, we can have a single log entry with both the records,
> and  the create topic request can be fulfilled atomically when both the
> records are committed?
>
> Thanks,
> Unmesh
>
> On Wed, Jul 8, 2020 at 6:57 AM Ron Dagostino <rndg...@gmail.com> wrote:
>
> > HI Colin.  Thanks for the KIP.  Here is some feedback and various
> > questions.
> >
> > "*Controller processes will listen on a separate port from brokers.  This
> > will be true even when the broker and controller are co-located in the
> same
> > JVM*". I assume it is possible that the port numbers could be the same
> when
> > using separate JVMs (i.e. broker uses port 9192 and controller also uses
> > port 9192).  I think it would be clearer to state this along these
> > lines: "Controller
> > nodes will listen on a port, and the controller port must differ from any
> > port that a broker in the same JVM is listening on.  In other words, a
> > controller and a broker node, when in the same JVM, do not share ports"
> >
> > I think the sentence "*In the realm of ACLs, this translates to
> controllers
> > requiring CLUSTERACTION on CLUSTER for all operations*" is confusing.  It
> > feels to me that you can just delete it.  Am I missing something here?
> >
> > The KIP states "*The metadata will be stored in memory on all the active
> > controllers.*"  Can there be multiple active controllers?  Should it
> > instead read "The metadata will be stored in memory on all potential
> > controllers." (or something like that)?
> >
> > KIP-595 states "*we have assumed the name __cluster_metadata for this
> > topic, but this is not a formal part of this proposal*".  This KIP-631
> > states "*Metadata changes need to be persisted to the __metadata log
> before
> > we propagate them to the other nodes in the cluster.  This means waiting
> > for the metadata log's last stable offset to advance to the offset of the
> > change.*"  Are we here formally defining "__metadata" as the topic name,
> > and should these sentences refer to "__metadata topic" rather than
> > "__metadata log"?  What are the "other nodes in the cluster" that are
> > referred to?  These are not controller nodes but brokers, right?  If so,
> > then should we say "before we propagate them to the brokers"?
> Technically
> > we have a controller cluster and a broker cluster -- two separate
> clusters,
> > correct?  (Even though we could potentially share JVMs and therefore
> > require no additional processes.). If the statement is referring to nodes
> > in both clusters then maybe we should state "before we propagate them to
> > the other nodes in the controller cluster or to brokers."
> >
> > "*The controller may have several of these uncommitted changes in flight
> at
> > any given time.  In essence, the controller's in-memory state is always a
> > little bit in the future compared to the current state.  This allows the
> > controller to continue doing things while it waits for the previous
> changes
> > to be committed to the Raft log.*"  Should the three references above be
> to
> > the active controller rather than just the controller?
> >
> > "*Therefore, the controller must not make this future state "visible" to
> > the rest of the cluster until it has been made persistent – that is,
> until
> > it becomes current state*". Again I wonder if this should refer to
> "active"
> > controller, and indicate "anyone else" as opposed to "the rest of the
> > cluster" since we are talking about 2 clusters here?
> >
> > "*When the active controller decides that it itself should create a
> > snapshot, it will first try to give up the leadership of the Raft
> quorum.*"
> > Why?  Is it necessary to state this?  It seems like it might be an
> > implementation detail rather than a necessary constraint/requirement that
> > we declare publicly and would have to abide by.
> >
> > "*It will reject brokers whose metadata is too stale*". Why?  An example
> > might be helpful here.
> >
> > "*it may lose subsequent conflicts if its broker epoch is stale*" This is
> > the first time a "broker epoch" is mentioned.  I am assuming it is the
> > controller epoch communicated to it (if any).  It would be good to
> > introduce it/explicitly state what it is before referring to it.
> >
> > Ron
> >
> > On Tue, Jul 7, 2020 at 6:48 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > I posted a KIP about how the quorum-based controller envisioned in
> > KIP-500
> > > will work.  Please take a look here:
> > > https://cwiki.apache.org/confluence/x/4RV4CQ
> > >
> > > best,
> > > Colin
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to