Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jun Rao Wed, 05 Aug 2020 13:59:04 -0700

Hi, Jose,

Thanks for the KIP. A few comments blow.

10. I agree with Jason that it's useful to document the motivation a bit
clearer. Regarding semantic/performance, one benefit of snapshotting is
that it allows changes to be encoded incrementally instead of using the
full post image. For example, in KIP-631, each partition has multiple
fields like assigned replicas, leader, epoch, isr, etc. If only isr is
changed, the snapshotting approach allows the change to be represented with
just the new value in isr. Compaction will require all existing fields to
be included in order to represent just an isr change. This is just because
we can customize the combining logic with snapshotting. As for the
performance benefit, I guess in theory snapshotting allows the snapshot to
be updated in-place incrementally without having to read the full state in
the snapshot. BTW, during compaction, we only read the cleaned data once
instead of 3 times.

11. The KIP mentions topic id. Currently there is no topic id. Does this
KIP depend on KIP-516?

12. Is there a need to keep more than 1 snapshot? It seems we always expose
the latest snapshot to clients.

13. "During leader election, followers with incomplete or missing snapshot
will send a vote request and response as if they had an empty log." Hmm, a
follower may not have a snapshot created, but that doesn't imply its log is
empty.

14. "LBO is max.replication.lag.ms old." Not sure that I follow. How do we
compare an offset to a time?

15. "Followers and observers will increase their log begin offset to the
value sent on the fetch response as long as the local state machine has
generated a snapshot that includes to the log begin offset minus one." Does
the observer store the log? I thought it only needed to maintain a
snapshot. If so, the observer doesn't need to maintain LBO.

16. "There are two cases when the Kafka Controller needs to load the
snapshot: When it is booting. When the follower and observer replicas
finishes fetching a new snapshot from the leader." For faster failover, it
seems it's useful for a non-controller voter to maintain the in-memory
metadata state. In order to do that, it seems that every voter needs to
load the snapshot on booting?

17. "There is an invariant that every log must have at least one snapshot
where the offset of the snapshot is between LogBeginOffset - 1 and
High-Watermark. If this is not the case then the replica should assume that
the log is empty." Does that invariant hold when there is no snapshot
initially?

18. "The __cluster_metadata topic will have an cleanup.policy value of
snapshot" Is there a need to make this configurable if it's read-only?

19. OFFSET_OUT_OF_RANGE in FetchSnapshotResponse: It seems that
POSITION_OUT_OF_RANGE is more appropriate?

Thanks,

Jun

On Wed, Aug 5, 2020 at 12:13 PM Jose Garcia Sancio <jsan...@confluent.io>
wrote:

> Once again, thanks for the feedback Jason,
>
> My changes to the KIP are here:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=18&selectedPageVersions=17
>
> And see my comments below...
>
> On Mon, Aug 3, 2020 at 1:57 PM Jason Gustafson <ja...@confluent.io> wrote:
> >
> > Hi Jose,
> >
> > Thanks for the proposal. I think there are three main motivations for
> > snapshotting over the existing compaction semantics.
> >
> > First we are arguing that compaction is a poor semantic fit for how we
> want
> > to model the metadata in the cluster. We are trying to view the changes
> in
> > the cluster as a stream of events, not necessarily as a stream of
> key/value
> > updates. The reason this is useful is that a single event may correspond
> to
> > a set of key/value updates. We don't need to delete each partition
> > individually for example if we are deleting the full topic. Outside of
> > deletion, however, the benefits of this approach are less obvious. I am
> > wondering if there are other cases where the event-based approach has
> some
> > benefit?
> >
>
> Yes. Another example of this is what KIP-631 calls FenceBroker. In the
> current implementation of the Kafka Controller and the implementation
> proposed in
> KIP-631, whenever a broker is fenced the controller removes the broker
> from the ISR and performs leader election if necessary. The impact of
> this operation on replication is documented in section "Amount of Data
> Replicated". I have also updated the KIP to reflect this.
>
> > The second motivation is from the perspective of consistency. Basically
> we
> > don't like the existing solution for the tombstone deletion problem,
> which
> > is just to add a delay before removal. The case we are concerned about
> > requires a replica to fetch up to a specific offset and then stall for a
> > time which is longer than the deletion retention timeout. If this
> happens,
> > then the replica might not see the tombstone, which would lead to an
> > inconsistent state. I think we are already talking about a rare case,
> but I
> > wonder if there are simple ways to tighten it further. For the sake of
> > argument, what if we had the replica start over from the beginning
> whenever
> > there is a replication delay which is longer than tombstone retention
> time?
> > Just want to be sure we're not missing any simple/pragmatic solutions
> > here...
> >
>
> We explore the changes needed to log compaction and the fetch protocol
> such that it results in a consistent replicated log in the rejected
> sections. I changed the KIP to also mention it in the motivation
> section by adding a section called "Consistent Log and Tombstones"
>
> > Finally, I think we are arguing that compaction gives a poor performance
> > tradeoff when the state is already in memory. It requires us to read and
> > replay all of the changes even though we already know the end result. One
> > way to think about it is that compaction works O(the rate of changes)
> while
> > snapshotting is O(the size of data). Contrarily, the nice thing about
> > compaction is that it works irrespective of the size of the data, which
> > makes it a better fit for user partitions. I feel like this might be an
> > argument we can make empirically or at least with back-of-the-napkin
> > calculations. If we assume a fixed size of data and a certain rate of
> > change, then what are the respective costs of snapshotting vs
> compaction? I
> > think compaction fares worse as the rate of change increases. In the case
> > of __consumer_offsets, which sometimes has to support a very high rate of
> > offset commits, I think snapshotting would be a great tradeoff to reduce
> > load time on coordinator failover. The rate of change for metadata on the
> > other hand might not be as high, though it can be very bursty.
> >
>
> This is a very good observation. If you assume that the number of keys
> doesn't change but that we have frequent updates to its values then I
> think that after log compaction the size of the compacted section of
> the log is O(size of the data) + O(size of the tombstones). And as you
> point out the size of the snapshot is also O(size of the data). I
> think this is a reasonable assumption for topics like
> __cluster_metadata and __consumer_offsets.
>
> The difference is the number of reads required. With in-memory
> snapshot we only need to read the log once. With log compaction we
> need to read the log 3 times: 1. to update the in-memory state, 2.
> generate the map of key to offset and 3. compact the log using the map
> of keys to offset. I have updated the KIP and go into a lot more
> details in section "Loading State and Frequency of Compaction
>
> --
> -Jose
>

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to