Hi Jose. Thanks for the KIP. Here are some questions and some nit corrections.
<<< In KIP-500 the Kafka Controller, which is the quorum leader from KIP-595, will materialize the entries in the metadata log into memory. Technically I think the quorum leader is referred to as the Active Controller in KIP-500. Maybe replace "Kafka Controller" with "Active Controller"? I think the term "Kafka Controller" is fine as used throughout the rest of the KIP to refer to the entire thing, but when referring specifically to the leader I think "Active Controller" is the term that is defined in KIP-500. <<< Each broker in KIP-500, which will be a replica of the metadata log, will materialize the entries in the log into a metadata cache This wording confused me because I assumed that "replica" was a formal term and only (non-Active) Controllers are formally "replicas" of the metadata log -- Kafka brokers would be clients that read the log and then use the data for their own purpose as opposed to formally being replicas with this understanding of the term "replica". Is that correct, and if so, maybe replace "replica" with "client"? <<< The type of in-memory state machines what we plan to implement >>> The type of in-memory state machines that we plan to implement nit <<< doesn't map very well to an key and offset based clean up policy. >>> doesn't map very well to a key and offset based clean up policy. nit <<< When starting a broker either because it is a new broker, a broker was upgraded or a failed broker is restarting. Loading the state represented by the __cluster_metadata topic partition is required before the broker is available >>> When starting a broker either because it is a new broker or it is >>> restarting, loading the state represented by the __cluster_metadata topic >>> partition is required before the broker is available. Reword for simplicity and clarity? <<< With snapshot based of the in-memory state Kafka can be much more aggressive >>> By taking and transmitting a snapshot of the in-memory state as described >>> below Kafka can be much more aggressive Tough to refer to the concept of snapshot here without having described what it is, so refer to "as described below" to help orient the reader? <<< In the future this mechanism will also be useful for high-throughput topic partitions like the Group Coordinator and Transaction Coordinator. >>> In the future this mechanism may also be useful for high-throughput topic >>> partitions like the Group Coordinator and Transaction Coordinator. Tough to say "will" when that is an assumption that would depend on a KIP? <<<If events/deltas are not allowed in the replicated log for the __cluster_metadata topic partition then the ... Kafka Controller will need to replicate 3.81 MB to each broker in the cluster (10) or 38.14 MB. It might be good to append a sentence that explicitly states how much data is replicated for the delta/event -- right now it is implied to be very small, but that's kind of like leaving the punch line to a joke implied :-) <<< Follower and observer replicas fetch the snapshots from the leader they attempt to fetch an offset from the leader and the leader doesn’t have that offset in the log >>> Follower and observer replicas fetch a snapshot from the leader when they >>> attempt to fetch an offset from the leader and the leader doesn’t have that >>> offset in the log nit >>> Generating and loading the snapshot will be delegated to the Kafka >>> Controller. >>> The Kafka Controller will notify the Kafka Raft client when it has >>> generated a snapshot and up to which offset is included in the snapshot. >>> The Kafka Raft client will notify the Kafka Controller when a new snapshot >>> has been fetched from the leader. This paragraph confuses me. What is the "Kafka Raft client" -- is this the broker? Or is it some other subsystem (or all other subsystems aside from log replication) within the Controller? Has this been defined somewhere? If so it would be good to refer to that definition. (Actually, now that I've read further down, I think you refer to this as "Kafka Raft" later in the KIP; a reference to these later sections or naming it Kafka Raft Client later on would have helped me avoid confusion -- I searched the doc for raft client rather than kafka raft, so I missed this when I searched.) <<< The Kafka Controller will notify the Kafka Raft client when it has finished generating a new snapshots. Same comment about "Kafka Raft client". <<< It is safe for the log to truncate a prefix of the log up to the latest snapshot. "log to truncate a prefix of the log" -- That first mention of "log" needs to be something else I assume -- LogManager maybe? <<< In the example above, if the Kafka topic partition leader receives a fetch request with an offset and epoch greater than or equal to the log begin offset (x, a) <<< In the example above, offset=x, epoch=a does not appear in the diagram because it is before the log begin offset (x+1, b) If the Kafka topic partition leader receives a fetch request with an offset and epoch greater than or equal to the LBO Maybe add an explicit comment that offset=x, epoch=a does not appear in the diagram because it is before the LBO of (x+1, b)? Also need to fix LBO reference (currently incorrectly stated as (x, a). <<< LBO can be increase/set to an offset X if the following is true: <<< 2. One of the following is true: Do the pair of conditions in (2) only apply to the leader/Active Controller? <<< The broker will delete any snapshot with a latest offset and epoch (SnapshotOffsetAndEpoch) less than the LBO - 1 (log begin offset). Is it also a condition that there must be a snapshot in existence "greater than" (in terms of offset, epoch) the one to be deleted? <<< There are two cases when the Kafka Controller needs to load the snapshot: <<< 1. When the broker is booting. >>> 1. When the Controller is booting. (Or "When it is booting") <<< 2. When the follower and observer replicas finishes fetches a new snapshot from the leader <<< 2. When a follower or observer replica finishes fetching a new snapshot from the leader grammar <<< The replica’s handling of fetch request will be extended such that if FetchOffset is less than LogBeginOffset then the leader will respond with >>> The leader's handling of fetch request will be extended such that if >>> FetchOffset is less than LogBeginOffset then it will respond with clarifying corrections <<< The replica's handling of fetch response will be extended such that if SnapshotOffsetAndEpoch is set <<< then the follower will truncate its entire log and start fetching from both the log and the snapshot. Does it logically truncate or physically truncate, and does it make a difference which one it is? Logical deletion is "softer" than a hard, physical delete and may be less risky if there were to be a catastrophe (i.e. we never want to lose any data, but if we ever get into the situation then might it be best to lose less data?) <<< If Continue is true, the follower is expected to send another FetchSnapshotRequest with a Position set to the response's Position plus the response's len(Bytes). Is there a cleanup mechanism necessary for the case where a partial snapshot is downloaded but the download never completes -- perhaps if the leader dies and doesn't come back for a long time? What would a follower do in that case, and how does cleanup work? <<<Change leader when generating snapshots: Instead of supporting concurrent snapshot generation and writing to the in-memory state, we could require that only non-leaders generate snapshots. Did the KIP ever explicitly state that leadership would or would not change? This is mentioned in KIP-631; maybe this belongs in that KIP rather than here? Ron > On Jul 26, 2020, at 8:10 PM, Jose Garcia Sancio <jsan...@confluent.io> wrote: > > Hi All, > > I would like to start a discussion on KIP-630: > https://cwiki.apache.org/confluence/x/exV4CQ > > This proposal specifies extensions to KIP-595 to support generating > snapshots for the replicated log. Please let me know if you have any > comments and suggestions. > > Thanks! > -- > -Jose