Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jose Garcia Sancio Tue, 04 Aug 2020 13:51:08 -0700

Thanks for your feedback Jason. I'll have a more detailed reply and
update to the KIP by EOD today.


On Mon, Aug 3, 2020 at 1:57 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> Hi Jose,
>
> Thanks for the proposal. I think there are three main motivations for
> snapshotting over the existing compaction semantics.
>
> First we are arguing that compaction is a poor semantic fit for how we want
> to model the metadata in the cluster. We are trying to view the changes in
> the cluster as a stream of events, not necessarily as a stream of key/value
> updates. The reason this is useful is that a single event may correspond to
> a set of key/value updates. We don't need to delete each partition
> individually for example if we are deleting the full topic. Outside of
> deletion, however, the benefits of this approach are less obvious. I am
> wondering if there are other cases where the event-based approach has some
> benefit?
>
> The second motivation is from the perspective of consistency. Basically we
> don't like the existing solution for the tombstone deletion problem, which
> is just to add a delay before removal. The case we are concerned about
> requires a replica to fetch up to a specific offset and then stall for a
> time which is longer than the deletion retention timeout. If this happens,
> then the replica might not see the tombstone, which would lead to an
> inconsistent state. I think we are already talking about a rare case, but I
> wonder if there are simple ways to tighten it further. For the sake of
> argument, what if we had the replica start over from the beginning whenever
> there is a replication delay which is longer than tombstone retention time?
> Just want to be sure we're not missing any simple/pragmatic solutions
> here...
>
> Finally, I think we are arguing that compaction gives a poor performance
> tradeoff when the state is already in memory. It requires us to read and
> replay all of the changes even though we already know the end result. One
> way to think about it is that compaction works O(the rate of changes) while
> snapshotting is O(the size of data). Contrarily, the nice thing about
> compaction is that it works irrespective of the size of the data, which
> makes it a better fit for user partitions. I feel like this might be an
> argument we can make empirically or at least with back-of-the-napkin
> calculations. If we assume a fixed size of data and a certain rate of
> change, then what are the respective costs of snapshotting vs compaction? I
> think compaction fares worse as the rate of change increases. In the case
> of __consumer_offsets, which sometimes has to support a very high rate of
> offset commits, I think snapshotting would be a great tradeoff to reduce
> load time on coordinator failover. The rate of change for metadata on the
> other hand might not be as high, though it can be very bursty.
>
> Thanks,
> Jason
>
>
> On Wed, Jul 29, 2020 at 2:03 PM Jose Garcia Sancio <jsan...@confluent.io>
> wrote:
>
> > Thanks Ron for the additional comments and suggestions.
> >
> > Here are the changes to the KIP:
> >
> > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=17&selectedPageVersions=15
> >
> > On Wed, Jul 29, 2020 at 8:44 AM Ron Dagostino <rndg...@gmail.com> wrote:
> > >
> > > Thanks, Jose.  It's looking good.  Here is one minor correction:
> > >
> > > <<< If the Kafka topic partition leader receives a fetch request with an
> > > offset and epoch greater than or equal to the LBO (x + 1, a)
> > > >>> If the Kafka topic partition leader receives a fetch request with an
> > > offset and epoch greater than or equal to the LBO (x + 1, b)
> > >
> >
> > Done.
> >
> > > Here is one more question.  Is there an ability to evolve the snapshot
> > > format over time, and if so, how is that managed for upgrades? It would
> > be
> > > both Controllers and Brokers that would depend on the format, correct?
> > > Those could be the same thing if the controller was running inside the
> > > broker JVM, but that is an option rather than a requirement, I think.
> > > Might the Controller upgrade have to be coordinated with the broker
> > upgrade
> > > in the separate-JVM case?  Perhaps a section discussing this would be
> > > appropriate?
> > >
> >
> > The content set though the FetchSnapshot RPC is expected to be
> > compatible with future changes. In KIP-631 the Kafka Controller is
> > going to use the existing Kafka Message and versioning scheme.
> > Specifically see section "Record Format Versions". I added some
> > wording around this.
> >
> > Thanks!
> > -Jose
> >



-- 
-Jose

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to