Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Ron Dagostino Mon, 27 Jul 2020 11:30:04 -0700

Hi Jose.  Thanks for the KIP.  Here are some questions and some nit corrections.

<<< In KIP-500 the Kafka Controller, which is the quorum leader from
KIP-595, will materialize the entries in the metadata log into memory.
Technically I think the quorum leader is referred to as the Active
Controller in KIP-500.  Maybe replace "Kafka Controller" with "Active
Controller"?  I think the term "Kafka Controller" is fine as used
throughout the rest of the KIP to refer to the entire thing, but when
referring specifically to the leader I think "Active Controller" is
the term that is defined in KIP-500.

<<< Each broker in KIP-500, which will be a replica of the metadata
log, will materialize the entries in the log into a metadata cache
This wording confused me because I assumed that "replica" was a formal
term and only (non-Active) Controllers are formally "replicas" of the
metadata log -- Kafka brokers would be clients that read the log and
then use the data for their own purpose as opposed to formally being
replicas with this understanding of the term "replica".  Is that
correct, and if so, maybe replace "replica" with "client"?

<<< The type of in-memory state machines what we plan to implement
>>> The type of in-memory state machines that we plan to implement
nit

<<< doesn't map very well to an key and offset based clean up policy.
>>> doesn't map very well to a key and offset based clean up policy.
nit

<<< When starting a broker either because it is a new broker, a broker
was upgraded or a failed broker is restarting. Loading the state
represented by the __cluster_metadata topic partition is required
before the broker is available
>>> When starting a broker either because it is a new broker or it is 
>>> restarting, loading the state represented by the __cluster_metadata topic 
>>> partition is required before the broker is available.
Reword for simplicity and clarity?

<<< With snapshot based of the in-memory state Kafka can be much more aggressive
>>> By taking and transmitting a snapshot of the in-memory state as described 
>>> below Kafka can be much more aggressive
Tough to refer to the concept of snapshot here without having
described what it is, so refer to "as described below" to help orient
the reader?

<<< In the future this mechanism will also be useful for
high-throughput topic partitions like the Group Coordinator and
Transaction Coordinator.
>>> In the future this mechanism may also be useful for high-throughput topic 
>>> partitions like the Group Coordinator and Transaction Coordinator.
Tough to say "will" when that is an assumption that would depend on a KIP?

<<<If events/deltas are not allowed in the replicated log for the
__cluster_metadata topic partition then the ... Kafka Controller will
need to replicate 3.81 MB to each broker in the cluster (10) or 38.14
MB.
It might be good to append a sentence that explicitly states how much
data is replicated for the delta/event -- right now it is implied to
be very small, but that's kind of like leaving the punch line to a
joke implied :-)

<<< Follower and observer replicas fetch the snapshots from the leader
they attempt to fetch an offset from the leader and the leader doesn’t
have that offset in the log
>>> Follower and observer replicas fetch a snapshot from the leader when they 
>>> attempt to fetch an offset from the leader and the leader doesn’t have that 
>>> offset in the log
nit

>>> Generating and loading the snapshot will be delegated to the Kafka 
>>> Controller.
>>> The Kafka Controller will notify the Kafka Raft client when it has 
>>> generated a snapshot and up to which offset is included in the snapshot.
>>> The Kafka Raft client will notify the Kafka Controller when a new snapshot 
>>> has been fetched from the leader.
This paragraph confuses me.  What is the "Kafka Raft client" -- is
this the broker? Or is it some other subsystem (or all other
subsystems aside from log replication) within the Controller?  Has
this been defined somewhere?  If so it would be good to refer to that
definition.  (Actually, now that I've read further down, I think you
refer to this as "Kafka Raft" later in the KIP; a reference to these
later sections or naming it Kafka Raft Client later on would have
helped me avoid confusion -- I searched the doc for raft client rather
than kafka raft, so I missed this when I searched.)

<<< The Kafka Controller will notify the Kafka Raft client when it has
finished generating a new snapshots.
Same comment about "Kafka Raft client".

<<< It is safe for the log to truncate a prefix of the log up to the
latest snapshot.
"log to truncate a prefix of the log" -- That first mention of "log"
needs to be something else I assume -- LogManager maybe?

<<< In the example above, if the Kafka topic partition leader receives
a fetch request with an offset and epoch greater than or equal to the
log begin offset (x, a)
<<< In the example above, offset=x, epoch=a does not appear in the
diagram because it is before the log begin offset (x+1, b)   If the
Kafka topic partition leader receives a fetch request with an offset
and epoch greater than or equal to the LBO
Maybe add an explicit comment that offset=x, epoch=a does not appear
in the diagram because it is before the LBO of (x+1, b)?  Also need to
fix LBO reference (currently incorrectly stated as (x, a).

<<< LBO can be increase/set to an offset X if the following is true:
<<< 2. One of the following is true:
Do the pair of conditions in (2) only apply to the leader/Active Controller?

<<< The broker will delete any snapshot with a latest offset and epoch
(SnapshotOffsetAndEpoch) less than the LBO - 1 (log begin offset).
Is it also a condition that there must be a snapshot in existence
"greater than" (in terms of offset, epoch) the one to be deleted?

<<< There are two cases when the Kafka Controller needs to load the snapshot:
<<< 1. When the broker is booting.
>>> 1. When the Controller is booting.
(Or "When it is booting")

<<< 2. When the follower and observer replicas finishes fetches a new
snapshot from the leader
<<< 2. When a follower or observer replica finishes fetching a new
snapshot from the leader
grammar

<<< The replica’s handling of fetch request will be extended such that
if FetchOffset is less than LogBeginOffset then the leader will
respond with
>>> The leader's handling of fetch request will be extended such that if 
>>> FetchOffset is less than LogBeginOffset then it will respond with
clarifying corrections

<<< The replica's handling of fetch response will be extended such
that if SnapshotOffsetAndEpoch is set
<<< then the follower will truncate its entire log and start fetching
from both the log and the snapshot.
Does it logically truncate or physically truncate, and does it make a
difference which one it is?  Logical deletion is "softer" than a hard,
physical delete and may be less risky if there were to be a
catastrophe (i.e. we never want to lose any data, but if we ever get
into the situation then might it be best to lose less data?)

<<< If Continue is true, the follower is expected to send another
FetchSnapshotRequest with a Position set to the response's Position
plus the response's len(Bytes).
Is there a cleanup mechanism necessary for the case where a partial
snapshot is downloaded but the download never completes -- perhaps if
the leader dies and doesn't come back for a long time?  What would a
follower do in that case, and how does cleanup work?

<<<Change leader when generating snapshots: Instead of supporting
concurrent snapshot generation and writing to the in-memory state, we
could require that only non-leaders generate snapshots.
Did the KIP ever explicitly state that leadership would or would not
change?  This is mentioned in KIP-631; maybe this belongs in that KIP
rather than here?

Ron

> On Jul 26, 2020, at 8:10 PM, Jose Garcia Sancio <jsan...@confluent.io> wrote:
>
> Hi All,
>
> I would like to start a discussion on KIP-630:
> https://cwiki.apache.org/confluence/x/exV4CQ
>
> This proposal specifies extensions to KIP-595 to support generating
> snapshots for the replicated log. Please let me know if you have any
> comments and suggestions.
>
> Thanks!
> --
> -Jose

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to