Let me add a little more color to my question about the controller.id. I have been assuming that each process would run a single Raft replication thread for the metadata quorum, just that its role might be different. Some would be voters and some would be observers. To me, that suggests that broker.id and controller.id should be the same thing, but it sounds like you are trying hard to make the controller process an isolated thing like Zookeeper was. In that case, how would it work for brokers which have both the "broker" and "controller" roles? Would they replicate the metadata log twice? Or would the broker skip "broker" replication if the process happens to also be a controller?
At a higher level, I'm not 100% sold on the hard lines you are trying to draw between the broker and controller roles. It seems simpler to me from an operational perspective if we can keep the configuration of all processes as homogeneous as possible, perhaps with the only difference being `process.roles`. It would help if you could motivate the need for this separation a bit better and how it will affect operations. -Jason On Mon, Jul 13, 2020 at 11:08 AM Boyang Chen <reluctanthero...@gmail.com> wrote: > Hey Colin, some quick questions, > > 1. I looked around and didn't find a config for broker heartbeat interval, > are we piggy-back on some existing configs? > 2. We only mentioned that the lease time is 10X of the heartbeat interval, > could we also include why we chose this value? > > On Mon, Jul 13, 2020 at 10:09 AM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hi Colin, > > > > Thanks for the proposal. A few initial comments comments/questions below: > > > > 1. I don't follow why we need a separate configuration for > > `controller.listeners`. The current listener configuration already allows > > users to specify multiple listeners, which allows them to define internal > > endpoints that are not exposed to clients. Can you explain what the new > > configuration gives us that we don't already have? > > 2. What is the advantage of creating a separate `controller.id` instead > of > > just using `broker.id`? > > 3. It sounds like you are imagining a stop-the-world approach to > > snapshotting, which is why we need the controller micromanaging snapshots > > on all followers. Alternatives include fuzzy snapshots which can be done > > concurrently. If this has been rejected, can you add some detail about > why? > > 4. More of a nit, but should `DeleteBrokerRecord` be > > `ShutdownBrokerRecord`? The broker is just getting removed from ISRs, but > > it would still be present in the replica set (I assume). > > > > Thanks, > > Jason > > > > On Sun, Jul 12, 2020 at 12:24 AM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > Hi Unmesh, > > > > > > That's an interesting idea, but I think it would be best to strive for > > > single metadata events that are complete in themselves, rather than > > trying > > > to do something transactional or EOS-like. For example, we could have > a > > > create event that contains all the partitions to be created. > > > > > > best, > > > Colin > > > > > > > > > On Fri, Jul 10, 2020, at 04:12, Unmesh Joshi wrote: > > > > I was thinking that we might need something like multi-operation > > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-965> record in > > > zookeeper > > > > to atomically create topic and partition records when this multi > record > > > is > > > > committed. This way metadata will have both the TopicRecord and > > > > PartitionRecord together always, and in no situation we can have > > > > TopicRecord without PartitionRecord. Not sure if there are other > > > situations > > > > where multi-operation is needed. > > > > <https://issues.apache.org/jira/browse/ZOOKEEPER-965> > > > > > > > > Thanks, > > > > Unmesh > > > > > > > > On Fri, Jul 10, 2020 at 11:32 AM Colin McCabe <cmcc...@apache.org> > > > wrote: > > > > > > > > > Hi Unmesh, > > > > > > > > > > Yes, once the last stable offset advanced, we would consider the > > topic > > > > > creation to be done, and then we could return success to the > client. > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > On Thu, Jul 9, 2020, at 19:44, Unmesh Joshi wrote: > > > > > > It still needs HighWaterMark / LastStableOffset to be advanced by > > two > > > > > > records? Something like following? > > > > > > > > > > > > > > > > > > | | > > > > > > <------------------ |----------------| HighWaterMark > > > > > > Response |PartitionRecord | > > > > > > | | > > > > > > -----------------| > > > > > > | TopicRecord | > > - > > > > > > | | > > > > > > -------------------> ------------------ Previous > HighWaterMark > > > > > > CreateTopic | | > > > > > > | | > > > > > > | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 10, 2020 at 1:30 AM Colin McCabe <cmcc...@apache.org > > > > > wrote: > > > > > > > > > > > > > On Thu, Jul 9, 2020, at 04:37, Unmesh Joshi wrote: > > > > > > > > I see that, when a new topic is created, two metadata > records, > > a > > > > > > > > TopicRecord (just the name and id of the topic) and a > > > PartitionRecord > > > > > > > (more > > > > > > > > like LeaderAndIsr, with leader id and replica ids for the > > > partition) > > > > > are > > > > > > > > created. > > > > > > > > While creating the topic, log entries for both the records > need > > > to be > > > > > > > > committed in RAFT core. Will it need something like a > > > > > > > MultiOperationRecord > > > > > > > > in zookeeper. Then, we can have a single log entry with both > > the > > > > > records, > > > > > > > > and the create topic request can be fulfilled atomically > when > > > both > > > > > the > > > > > > > > records are committed? > > > > > > > > > > > > > > > > > > > > > > Hi Unmesh, > > > > > > > > > > > > > > Since the active controller is the only node writing to the > log, > > > there > > > > > is > > > > > > > no need for any kind of synchronization or access control at > the > > > log > > > > > level. > > > > > > > > > > > > > > best, > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Unmesh > > > > > > > > > > > > > > > > On Wed, Jul 8, 2020 at 6:57 AM Ron Dagostino < > > rndg...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > HI Colin. Thanks for the KIP. Here is some feedback and > > > various > > > > > > > > > questions. > > > > > > > > > > > > > > > > > > "*Controller processes will listen on a separate port from > > > brokers. > > > > > > > This > > > > > > > > > will be true even when the broker and controller are > > > co-located in > > > > > the > > > > > > > same > > > > > > > > > JVM*". I assume it is possible that the port numbers could > be > > > the > > > > > same > > > > > > > when > > > > > > > > > using separate JVMs (i.e. broker uses port 9192 and > > controller > > > also > > > > > > > uses > > > > > > > > > port 9192). I think it would be clearer to state this > along > > > these > > > > > > > > > lines: "Controller > > > > > > > > > nodes will listen on a port, and the controller port must > > > differ > > > > > from > > > > > > > any > > > > > > > > > port that a broker in the same JVM is listening on. In > other > > > > > words, a > > > > > > > > > controller and a broker node, when in the same JVM, do not > > > share > > > > > ports" > > > > > > > > > > > > > > > > > > I think the sentence "*In the realm of ACLs, this > translates > > to > > > > > > > controllers > > > > > > > > > requiring CLUSTERACTION on CLUSTER for all operations*" is > > > > > confusing. > > > > > > > It > > > > > > > > > feels to me that you can just delete it. Am I missing > > > something > > > > > here? > > > > > > > > > > > > > > > > > > The KIP states "*The metadata will be stored in memory on > all > > > the > > > > > > > active > > > > > > > > > controllers.*" Can there be multiple active controllers? > > > Should > > > > > it > > > > > > > > > instead read "The metadata will be stored in memory on all > > > > > potential > > > > > > > > > controllers." (or something like that)? > > > > > > > > > > > > > > > > > > KIP-595 states "*we have assumed the name > __cluster_metadata > > > for > > > > > this > > > > > > > > > topic, but this is not a formal part of this proposal*". > > This > > > > > KIP-631 > > > > > > > > > states "*Metadata changes need to be persisted to the > > > __metadata > > > > > log > > > > > > > before > > > > > > > > > we propagate them to the other nodes in the cluster. This > > > means > > > > > > > waiting > > > > > > > > > for the metadata log's last stable offset to advance to the > > > offset > > > > > of > > > > > > > the > > > > > > > > > change.*" Are we here formally defining "__metadata" as > the > > > topic > > > > > > > name, > > > > > > > > > and should these sentences refer to "__metadata topic" > rather > > > than > > > > > > > > > "__metadata log"? What are the "other nodes in the > cluster" > > > that > > > > > are > > > > > > > > > referred to? These are not controller nodes but brokers, > > > right? > > > > > If > > > > > > > so, > > > > > > > > > then should we say "before we propagate them to the > brokers"? > > > > > > > Technically > > > > > > > > > we have a controller cluster and a broker cluster -- two > > > separate > > > > > > > clusters, > > > > > > > > > correct? (Even though we could potentially share JVMs and > > > > > therefore > > > > > > > > > require no additional processes.). If the statement is > > > referring to > > > > > > > nodes > > > > > > > > > in both clusters then maybe we should state "before we > > > propagate > > > > > them > > > > > > > to > > > > > > > > > the other nodes in the controller cluster or to brokers." > > > > > > > > > > > > > > > > > > "*The controller may have several of these uncommitted > > changes > > > in > > > > > > > flight at > > > > > > > > > any given time. In essence, the controller's in-memory > state > > > is > > > > > > > always a > > > > > > > > > little bit in the future compared to the current state. > This > > > > > allows > > > > > > > the > > > > > > > > > controller to continue doing things while it waits for the > > > previous > > > > > > > changes > > > > > > > > > to be committed to the Raft log.*" Should the three > > references > > > > > above > > > > > > > be to > > > > > > > > > the active controller rather than just the controller? > > > > > > > > > > > > > > > > > > "*Therefore, the controller must not make this future state > > > > > "visible" > > > > > > > to > > > > > > > > > the rest of the cluster until it has been made persistent – > > > that > > > > > is, > > > > > > > until > > > > > > > > > it becomes current state*". Again I wonder if this should > > > refer to > > > > > > > "active" > > > > > > > > > controller, and indicate "anyone else" as opposed to "the > > rest > > > of > > > > > the > > > > > > > > > cluster" since we are talking about 2 clusters here? > > > > > > > > > > > > > > > > > > "*When the active controller decides that it itself should > > > create a > > > > > > > > > snapshot, it will first try to give up the leadership of > the > > > Raft > > > > > > > quorum.*" > > > > > > > > > Why? Is it necessary to state this? It seems like it > might > > > be an > > > > > > > > > implementation detail rather than a necessary > > > > > constraint/requirement > > > > > > > that > > > > > > > > > we declare publicly and would have to abide by. > > > > > > > > > > > > > > > > > > "*It will reject brokers whose metadata is too stale*". > Why? > > > An > > > > > > > example > > > > > > > > > might be helpful here. > > > > > > > > > > > > > > > > > > "*it may lose subsequent conflicts if its broker epoch is > > > stale*" > > > > > This > > > > > > > is > > > > > > > > > the first time a "broker epoch" is mentioned. I am > assuming > > > it is > > > > > the > > > > > > > > > controller epoch communicated to it (if any). It would be > > > good to > > > > > > > > > introduce it/explicitly state what it is before referring > to > > > it. > > > > > > > > > > > > > > > > > > Ron > > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2020 at 6:48 PM Colin McCabe < > > > cmcc...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > I posted a KIP about how the quorum-based controller > > > envisioned > > > > > in > > > > > > > > > KIP-500 > > > > > > > > > > will work. Please take a look here: > > > > > > > > > > https://cwiki.apache.org/confluence/x/4RV4CQ > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >