Thanks for your feedback Jason. I'll have a more detailed reply and update to the KIP by EOD today.
On Mon, Aug 3, 2020 at 1:57 PM Jason Gustafson <ja...@confluent.io> wrote: > > Hi Jose, > > Thanks for the proposal. I think there are three main motivations for > snapshotting over the existing compaction semantics. > > First we are arguing that compaction is a poor semantic fit for how we want > to model the metadata in the cluster. We are trying to view the changes in > the cluster as a stream of events, not necessarily as a stream of key/value > updates. The reason this is useful is that a single event may correspond to > a set of key/value updates. We don't need to delete each partition > individually for example if we are deleting the full topic. Outside of > deletion, however, the benefits of this approach are less obvious. I am > wondering if there are other cases where the event-based approach has some > benefit? > > The second motivation is from the perspective of consistency. Basically we > don't like the existing solution for the tombstone deletion problem, which > is just to add a delay before removal. The case we are concerned about > requires a replica to fetch up to a specific offset and then stall for a > time which is longer than the deletion retention timeout. If this happens, > then the replica might not see the tombstone, which would lead to an > inconsistent state. I think we are already talking about a rare case, but I > wonder if there are simple ways to tighten it further. For the sake of > argument, what if we had the replica start over from the beginning whenever > there is a replication delay which is longer than tombstone retention time? > Just want to be sure we're not missing any simple/pragmatic solutions > here... > > Finally, I think we are arguing that compaction gives a poor performance > tradeoff when the state is already in memory. It requires us to read and > replay all of the changes even though we already know the end result. One > way to think about it is that compaction works O(the rate of changes) while > snapshotting is O(the size of data). Contrarily, the nice thing about > compaction is that it works irrespective of the size of the data, which > makes it a better fit for user partitions. I feel like this might be an > argument we can make empirically or at least with back-of-the-napkin > calculations. If we assume a fixed size of data and a certain rate of > change, then what are the respective costs of snapshotting vs compaction? I > think compaction fares worse as the rate of change increases. In the case > of __consumer_offsets, which sometimes has to support a very high rate of > offset commits, I think snapshotting would be a great tradeoff to reduce > load time on coordinator failover. The rate of change for metadata on the > other hand might not be as high, though it can be very bursty. > > Thanks, > Jason > > > On Wed, Jul 29, 2020 at 2:03 PM Jose Garcia Sancio <jsan...@confluent.io> > wrote: > > > Thanks Ron for the additional comments and suggestions. > > > > Here are the changes to the KIP: > > > > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=17&selectedPageVersions=15 > > > > On Wed, Jul 29, 2020 at 8:44 AM Ron Dagostino <rndg...@gmail.com> wrote: > > > > > > Thanks, Jose. It's looking good. Here is one minor correction: > > > > > > <<< If the Kafka topic partition leader receives a fetch request with an > > > offset and epoch greater than or equal to the LBO (x + 1, a) > > > >>> If the Kafka topic partition leader receives a fetch request with an > > > offset and epoch greater than or equal to the LBO (x + 1, b) > > > > > > > Done. > > > > > Here is one more question. Is there an ability to evolve the snapshot > > > format over time, and if so, how is that managed for upgrades? It would > > be > > > both Controllers and Brokers that would depend on the format, correct? > > > Those could be the same thing if the controller was running inside the > > > broker JVM, but that is an option rather than a requirement, I think. > > > Might the Controller upgrade have to be coordinated with the broker > > upgrade > > > in the separate-JVM case? Perhaps a section discussing this would be > > > appropriate? > > > > > > > The content set though the FetchSnapshot RPC is expected to be > > compatible with future changes. In KIP-631 the Kafka Controller is > > going to use the existing Kafka Message and versioning scheme. > > Specifically see section "Record Format Versions". I added some > > wording around this. > > > > Thanks! > > -Jose > > -- -Jose