Hi Jose, Thanks for the proposal. I think there are three main motivations for snapshotting over the existing compaction semantics.
First we are arguing that compaction is a poor semantic fit for how we want to model the metadata in the cluster. We are trying to view the changes in the cluster as a stream of events, not necessarily as a stream of key/value updates. The reason this is useful is that a single event may correspond to a set of key/value updates. We don't need to delete each partition individually for example if we are deleting the full topic. Outside of deletion, however, the benefits of this approach are less obvious. I am wondering if there are other cases where the event-based approach has some benefit? The second motivation is from the perspective of consistency. Basically we don't like the existing solution for the tombstone deletion problem, which is just to add a delay before removal. The case we are concerned about requires a replica to fetch up to a specific offset and then stall for a time which is longer than the deletion retention timeout. If this happens, then the replica might not see the tombstone, which would lead to an inconsistent state. I think we are already talking about a rare case, but I wonder if there are simple ways to tighten it further. For the sake of argument, what if we had the replica start over from the beginning whenever there is a replication delay which is longer than tombstone retention time? Just want to be sure we're not missing any simple/pragmatic solutions here... Finally, I think we are arguing that compaction gives a poor performance tradeoff when the state is already in memory. It requires us to read and replay all of the changes even though we already know the end result. One way to think about it is that compaction works O(the rate of changes) while snapshotting is O(the size of data). Contrarily, the nice thing about compaction is that it works irrespective of the size of the data, which makes it a better fit for user partitions. I feel like this might be an argument we can make empirically or at least with back-of-the-napkin calculations. If we assume a fixed size of data and a certain rate of change, then what are the respective costs of snapshotting vs compaction? I think compaction fares worse as the rate of change increases. In the case of __consumer_offsets, which sometimes has to support a very high rate of offset commits, I think snapshotting would be a great tradeoff to reduce load time on coordinator failover. The rate of change for metadata on the other hand might not be as high, though it can be very bursty. Thanks, Jason On Wed, Jul 29, 2020 at 2:03 PM Jose Garcia Sancio <jsan...@confluent.io> wrote: > Thanks Ron for the additional comments and suggestions. > > Here are the changes to the KIP: > > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=17&selectedPageVersions=15 > > On Wed, Jul 29, 2020 at 8:44 AM Ron Dagostino <rndg...@gmail.com> wrote: > > > > Thanks, Jose. It's looking good. Here is one minor correction: > > > > <<< If the Kafka topic partition leader receives a fetch request with an > > offset and epoch greater than or equal to the LBO (x + 1, a) > > >>> If the Kafka topic partition leader receives a fetch request with an > > offset and epoch greater than or equal to the LBO (x + 1, b) > > > > Done. > > > Here is one more question. Is there an ability to evolve the snapshot > > format over time, and if so, how is that managed for upgrades? It would > be > > both Controllers and Brokers that would depend on the format, correct? > > Those could be the same thing if the controller was running inside the > > broker JVM, but that is an option rather than a requirement, I think. > > Might the Controller upgrade have to be coordinated with the broker > upgrade > > in the separate-JVM case? Perhaps a section discussing this would be > > appropriate? > > > > The content set though the FetchSnapshot RPC is expected to be > compatible with future changes. In KIP-631 the Kafka Controller is > going to use the existing Kafka Message and versioning scheme. > Specifically see section "Record Format Versions". I added some > wording around this. > > Thanks! > -Jose >