Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Lucas Brutschy Mon, 02 Jan 2023 08:18:41 -0800

Hi Nick,

I'm just starting to read up on the whole discussion about KIP-892 and
KIP-844. Thanks a lot for your work on this, I do think
`WriteBatchWithIndex` may be the way to go here. I do have some
questions about the latest draft.


 A) If I understand correctly, you propose to put a bound on the
(native) memory consumed by each task. However, I wonder if this is
sufficient if we have temporary imbalances in the cluster. For
example, depending on the timing of rebalances during a cluster
restart, it could happen that a single streams node is assigned a lot
more tasks than expected. With your proposed change, this would mean
that the memory required by this one node could be a multiple of what
is required during normal operation. I wonder if it wouldn't be safer
to put a global bound on the memory use, across all tasks.
 B) Generally, the memory concerns still give me the feeling that this
should not be enabled by default for all users in a minor release.
 C) In section "Transaction Management": the sentence "A similar
analogue will be created to automatically manage `Segment`
transactions.". Maybe this is just me lacking some background, but I
do not understand this, it would be great if you could clarify what
you mean here.
 D) Could you please clarify why IQ has to call newTransaction(), when
it's read-only.

And one last thing not strictly related to your KIP: if there is an
easy way for you to find out why the KIP-844 PoC is 20x slower (e.g.
by providing a flame graph), that would be quite interesting.

Cheers,
Lucas

On Thu, Dec 22, 2022 at 8:30 PM Nick Telford <nick.telf...@gmail.com> wrote:
>
> Hi everyone,
>
> I've updated the KIP with a more detailed design, which reflects the
> implementation I've been working on:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
>
> This new design should address the outstanding points already made in the
> thread.
>
> Please let me know if there are areas that are unclear or need more
> clarification.
>
> I have a (nearly) working implementation. I'm confident that the remaining
> work (making Segments behave) will not impact the documented design.
>
> Regards,
>
> Nick
>
> On Tue, 6 Dec 2022 at 19:24, Colt McNealy <c...@littlehorse.io> wrote:
>
> > Nick,
> >
> > Thank you for the reply; that makes sense. I was hoping that, since reading
> > uncommitted records from IQ in EOS isn't part of the documented API, maybe
> > you *wouldn't* have to wait for the next major release to make that change;
> > but given that it would be considered a major change, I like your approach
> > the best.
> >
> > Wishing you a speedy recovery and happy coding!
> >
> > Thanks,
> > Colt McNealy
> > *Founder, LittleHorse.io*
> >
> >
> > On Tue, Dec 6, 2022 at 10:30 AM Nick Telford <nick.telf...@gmail.com>
> > wrote:
> >
> > > Hi Colt,
> > >
> > > 10: Yes, I agree it's not ideal. I originally intended to try to keep the
> > > behaviour unchanged as much as possible, otherwise we'd have to wait for
> > a
> > > major version release to land these changes.
> > > 20: Good point, ALOS doesn't need the same level of guarantee, and the
> > > typically longer commit intervals would be problematic when reading only
> > > "committed" records.
> > >
> > > I've been away for 5 days recovering from minor surgery, but I spent a
> > > considerable amount of that time working through ideas for possible
> > > solutions in my head. I think your suggestion of keeping ALOS as-is, but
> > > buffering writes for EOS is the right path forwards, although I have a
> > > solution that both expands on this, and provides for some more formal
> > > guarantees.
> > >
> > > Essentially, adding support to KeyValueStores for "Transactions", with
> > > clearly defined IsolationLevels. Using "Read Committed" when under EOS,
> > and
> > > "Read Uncommitted" under ALOS.
> > >
> > > The nice thing about this approach is that it gives us much more clearly
> > > defined isolation behaviour that can be properly documented to ensure
> > users
> > > know what to expect.
> > >
> > > I'm still working out the kinks in the design, and will update the KIP
> > when
> > > I have something. The main struggle is trying to implement this without
> > > making any major changes to the existing interfaces or breaking existing
> > > implementations, because currently everything expects to operate directly
> > > on a StateStore, and not a Transaction of that store. I think I'm getting
> > > close, although sadly I won't be able to progress much until next week
> > due
> > > to some work commitments.
> > >
> > > Regards,
> > > Nick
> > >
> > > On Thu, 1 Dec 2022 at 00:01, Colt McNealy <c...@littlehorse.io> wrote:
> > >
> > > > Nick,
> > > >
> > > > Thank you for the explanation, and also for the updated KIP. I am quite
> > > > eager for this improvement to be released as it would greatly reduce
> > the
> > > > operational difficulties of EOS streams apps.
> > > >
> > > > Two questions:
> > > >
> > > > 10)
> > > > >When reading records, we will use the
> > > > WriteBatchWithIndex#getFromBatchAndDB
> > > >  and WriteBatchWithIndex#newIteratorWithBase utilities in order to
> > ensure
> > > > that uncommitted writes are available to query.
> > > > Why do extra work to enable the reading of uncommitted writes during
> > IQ?
> > > > Code complexity aside, reading uncommitted writes is, in my opinion, a
> > > > minor flaw in EOS IQ; it would be very nice to have the guarantee that,
> > > > with EOS, IQ only reads committed records. In order to avoid dirty
> > reads,
> > > > one currently must query a standby replica (but this still doesn't
> > fully
> > > > guarantee monotonic reads).
> > > >
> > > > 20) Is it also necessary to enable this optimization on ALOS stores?
> > The
> > > > motivation of KIP-844 was mainly to reduce the need to restore state
> > from
> > > > scratch on unclean EOS shutdowns; with ALOS it was acceptable to accept
> > > > that there may have been uncommitted writes on disk. On a side note, if
> > > you
> > > > enable this type of store on ALOS processors, the community would
> > > > definitely want to enable queries on dirty reads; otherwise users would
> > > > have to wait 30 seconds (default) to see an update.
> > > >
> > > > Thank you for doing this fantastic work!
> > > > Colt McNealy
> > > > *Founder, LittleHorse.io*
> > > >
> > > >
> > > > On Wed, Nov 30, 2022 at 10:44 AM Nick Telford <nick.telf...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I've drastically reduced the scope of this KIP to no longer include
> > the
> > > > > StateStore management of checkpointing. This can be added as a KIP
> > > later
> > > > on
> > > > > to further optimize the consistency and performance of state stores.
> > > > >
> > > > > I've also added a section discussing some of the concerns around
> > > > > concurrency, especially in the presence of Iterators. I'm thinking of
> > > > > wrapping WriteBatchWithIndex with a reference-counting copy-on-write
> > > > > implementation (that only makes a copy if there's an active
> > iterator),
> > > > but
> > > > > I'm open to suggestions.
> > > > >
> > > > > Regards,
> > > > > Nick
> > > > >
> > > > > On Mon, 28 Nov 2022 at 16:36, Nick Telford <nick.telf...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Colt,
> > > > > >
> > > > > > I didn't do any profiling, but the 844 implementation:
> > > > > >
> > > > > >    - Writes uncommitted records to a temporary RocksDB instance
> > > > > >       - Since tombstones need to be flagged, all record values are
> > > > > >       prefixed with a value/tombstone marker. This necessitates a
> > > > memory
> > > > > copy.
> > > > > >    - On-commit, iterates all records in this temporary instance and
> > > > > >    writes them to the main RocksDB store.
> > > > > >    - While iterating, the value/tombstone marker needs to be parsed
> > > and
> > > > > >    the real value extracted. This necessitates another memory copy.
> > > > > >
> > > > > > My guess is that the cost of iterating the temporary RocksDB store
> > is
> > > > the
> > > > > > major factor, with the 2 extra memory copies per-Record
> > contributing
> > > a
> > > > > > significant amount too.
> > > > > >
> > > > > > Regards,
> > > > > > Nick
> > > > > >
> > > > > > On Mon, 28 Nov 2022 at 16:12, Colt McNealy <c...@littlehorse.io>
> > > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> Out of curiosity, why does the performance of the store degrade so
> > > > > >> significantly with the 844 implementation? I wouldn't be too
> > > surprised
> > > > > by
> > > > > >> a
> > > > > >> 50-60% drop (caused by each record being written twice), but 96%
> > is
> > > > > >> extreme.
> > > > > >>
> > > > > >> The only thing I can think of which could create such a bottleneck
> > > > would
> > > > > >> be
> > > > > >> that perhaps the 844 implementation deserializes and then
> > > > re-serializes
> > > > > >> the
> > > > > >> store values when copying from the uncommitted to committed store,
> > > > but I
> > > > > >> wasn't able to figure that out when I scanned the PR.
> > > > > >>
> > > > > >> Colt McNealy
> > > > > >> *Founder, LittleHorse.io*
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Nov 28, 2022 at 7:56 AM Nick Telford <
> > > nick.telf...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi everyone,
> > > > > >> >
> > > > > >> > I've updated the KIP to resolve all the points that have been
> > > raised
> > > > > so
> > > > > >> > far, with one exception: the ALOS default commit interval of 5
> > > > minutes
> > > > > >> is
> > > > > >> > likely to cause WriteBatchWithIndex memory to grow too large.
> > > > > >> >
> > > > > >> > There's a couple of different things I can think of to solve
> > this:
> > > > > >> >
> > > > > >> >    - We already have a memory/record limit in the KIP to prevent
> > > OOM
> > > > > >> >    errors. Should we choose a default value for these? My
> > concern
> > > > here
> > > > > >> is
> > > > > >> > that
> > > > > >> >    anything we choose might seem rather arbitrary. We could
> > change
> > > > > >> >    its behaviour such that under ALOS, it only triggers the
> > commit
> > > > of
> > > > > >> the
> > > > > >> >    StateStore, but under EOS, it triggers a commit of the Kafka
> > > > > >> > transaction.
> > > > > >> >    - We could introduce a separate `checkpoint.interval.ms` to
> > > > allow
> > > > > >> ALOS
> > > > > >> >    to commit the StateStores more frequently than the general
> > > > > >> >    commit.interval.ms? My concern here is that the semantics of
> > > > this
> > > > > >> > config
> > > > > >> >    would depend on the processing.mode; under ALOS it would
> > allow
> > > > more
> > > > > >> >    frequently committing stores, whereas under EOS it couldn't.
> > > > > >> >
> > > > > >> > Any better ideas?
> > > > > >> >
> > > > > >> > On Wed, 23 Nov 2022 at 16:25, Nick Telford <
> > > nick.telf...@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Hi Alex,
> > > > > >> > >
> > > > > >> > > Thanks for the feedback.
> > > > > >> > >
> > > > > >> > > I've updated the discussion of OOM issues by describing how
> > > we'll
> > > > > >> handle
> > > > > >> > > it. Here's the new text:
> > > > > >> > >
> > > > > >> > > To mitigate this, we will automatically force a Task commit if
> > > the
> > > > > >> total
> > > > > >> > >> uncommitted records returned by
> > > > > >> > >> StateStore#approximateNumUncommittedEntries()  exceeds a
> > > > threshold,
> > > > > >> > >> configured by max.uncommitted.state.entries.per.task; or the
> > > > total
> > > > > >> > >> memory used for buffering uncommitted records returned by
> > > > > >> > >> StateStore#approximateNumUncommittedBytes() exceeds the
> > > threshold
> > > > > >> > >> configured by max.uncommitted.state.bytes.per.task. This will
> > > > > roughly
> > > > > >> > >> bound the memory required per-Task for buffering uncommitted
> > > > > records,
> > > > > >> > >> irrespective of the commit.interval.ms, and will effectively
> > > > bound
> > > > > >> the
> > > > > >> > >> number of records that will need to be restored in the event
> > > of a
> > > > > >> > failure.
> > > > > >> > >>
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > These limits will be checked in StreamTask#process and a
> > > premature
> > > > > >> commit
> > > > > >> > >> will be requested via Task#requestCommit().
> > > > > >> > >>
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Note that these new methods provide default implementations
> > that
> > > > > >> ensure
> > > > > >> > >> existing custom stores and non-transactional stores (e.g.
> > > > > >> > >> InMemoryKeyValueStore) do not force any early commits.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > I've chosen to have the StateStore expose approximations of
> > its
> > > > > buffer
> > > > > >> > > size/count instead of opaquely requesting a commit in order to
> > > > > >> delegate
> > > > > >> > the
> > > > > >> > > decision making to the Task itself. This enables Tasks to look
> > > at
> > > > > >> *all*
> > > > > >> > of
> > > > > >> > > their StateStores, and determine whether an early commit is
> > > > > necessary.
> > > > > >> > > Notably, it enables pre-Task thresholds, instead of per-Store,
> > > > which
> > > > > >> > > prevents Tasks with many StateStores from using much more
> > memory
> > > > > than
> > > > > >> > Tasks
> > > > > >> > > with one StateStore. This makes sense, since commits are done
> > > > > by-Task,
> > > > > >> > not
> > > > > >> > > by-Store.
> > > > > >> > >
> > > > > >> > > Prizes* for anyone who can come up with a better name for the
> > > new
> > > > > >> config
> > > > > >> > > properties!
> > > > > >> > >
> > > > > >> > > Thanks for pointing out the potential performance issues of
> > > WBWI.
> > > > > From
> > > > > >> > the
> > > > > >> > > benchmarks that user posted[1], it looks like WBWI still
> > > performs
> > > > > >> > > considerably better than individual puts, which is the
> > existing
> > > > > >> design,
> > > > > >> > so
> > > > > >> > > I'd actually expect a performance boost from WBWI, just not as
> > > > great
> > > > > >> as
> > > > > >> > > we'd get from a plain WriteBatch. This does suggest that a
> > good
> > > > > >> > > optimization would be to use a regular WriteBatch for
> > > restoration
> > > > > (in
> > > > > >> > > RocksDBStore#restoreBatch), since we know that those records
> > > will
> > > > > >> never
> > > > > >> > be
> > > > > >> > > queried before they're committed.
> > > > > >> > >
> > > > > >> > > 1:
> > > > > >> >
> > > > >
> > > https://github.com/adamretter/rocksjava-write-methods-benchmark#results
> > > > > >> > >
> > > > > >> > > * Just kidding, no prizes, sadly.
> > > > > >> > >
> > > > > >> > > On Wed, 23 Nov 2022 at 12:28, Alexander Sorokoumov
> > > > > >> > > <asorokou...@confluent.io.invalid> wrote:
> > > > > >> > >
> > > > > >> > >> Hey Nick,
> > > > > >> > >>
> > > > > >> > >> Thank you for the KIP! With such a significant performance
> > > > > >> degradation
> > > > > >> > in
> > > > > >> > >> the secondary store approach, we should definitely consider
> > > > > >> > >> WriteBatchWithIndex. I also like encapsulating checkpointing
> > > > inside
> > > > > >> the
> > > > > >> > >> default state store implementation to improve performance.
> > > > > >> > >>
> > > > > >> > >> +1 to John's comment to keep the current checkpointing as a
> > > > > fallback
> > > > > >> > >> mechanism. We want to keep existing users' workflows intact
> > if
> > > we
> > > > > >> can. A
> > > > > >> > >> non-intrusive way would be to add a separate StateStore
> > method,
> > > > > say,
> > > > > >> > >> StateStore#managesCheckpointing(), that controls whether the
> > > > state
> > > > > >> store
> > > > > >> > >> implementation owns checkpointing.
> > > > > >> > >>
> > > > > >> > >> I think that a solution to the transactional writes should
> > > > address
> > > > > >> the
> > > > > >> > >> OOMEs. One possible way to address that is to wire
> > StateStore's
> > > > > >> commit
> > > > > >> > >> request by adding, say, StateStore#commitNeeded that is
> > checked
> > > > in
> > > > > >> > >> StreamTask#commitNeeded via the corresponding
> > > > > ProcessorStateManager.
> > > > > >> > With
> > > > > >> > >> that change, RocksDBStore will have to track the current
> > > > > transaction
> > > > > >> > size
> > > > > >> > >> and request a commit when the size goes over a (configurable)
> > > > > >> threshold.
> > > > > >> > >>
> > > > > >> > >> AFAIU WriteBatchWithIndex might perform significantly slower
> > > than
> > > > > >> > non-txn
> > > > > >> > >> puts as the batch size grows [1]. We should have a
> > > configuration
> > > > to
> > > > > >> fall
> > > > > >> > >> back to the current behavior (and/or disable txn stores for
> > > ALOS)
> > > > > >> unless
> > > > > >> > >> the benchmarks show negligible overhead for longer commits /
> > > > > >> > large-enough
> > > > > >> > >> batch sizes.
> > > > > >> > >>
> > > > > >> > >> If you prefer to keep the KIP smaller, I would rather cut out
> > > > > >> > >> state-store-managed checkpointing rather than proper OOMe
> > > > handling
> > > > > >> and
> > > > > >> > >> being able to switch to non-txn behavior. The checkpointing
> > is
> > > > not
> > > > > >> > >> necessary to solve the recovery-under-EOS problem. On the
> > other
> > > > > hand,
> > > > > >> > once
> > > > > >> > >> WriteBatchWithIndex is in, it will be much easier to add
> > > > > >> > >> state-store-managed checkpointing.
> > > > > >> > >>
> > > > > >> > >> If you share the current implementation, I am happy to help
> > you
> > > > > >> address
> > > > > >> > >> the
> > > > > >> > >> OOMe and configuration parts as well as review and test the
> > > > patch.
> > > > > >> > >>
> > > > > >> > >> Best,
> > > > > >> > >> Alex
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> 1. https://github.com/facebook/rocksdb/issues/608
> > > > > >> > >>
> > > > > >> > >> On Tue, Nov 22, 2022 at 6:31 PM Nick Telford <
> > > > > nick.telf...@gmail.com
> > > > > >> >
> > > > > >> > >> wrote:
> > > > > >> > >>
> > > > > >> > >> > Hi John,
> > > > > >> > >> >
> > > > > >> > >> > Thanks for the review and feedback!
> > > > > >> > >> >
> > > > > >> > >> > 1. Custom Stores: I've been mulling over this problem
> > myself.
> > > > As
> > > > > it
> > > > > >> > >> stands,
> > > > > >> > >> > custom stores would essentially lose checkpointing with no
> > > > > >> indication
> > > > > >> > >> that
> > > > > >> > >> > they're expected to make changes, besides a line in the
> > > release
> > > > > >> > notes. I
> > > > > >> > >> > agree that the best solution would be to provide a default
> > > that
> > > > > >> > >> checkpoints
> > > > > >> > >> > to a file. The one thing I would change is that the
> > > > checkpointing
> > > > > >> is
> > > > > >> > to
> > > > > >> > >> a
> > > > > >> > >> > store-local file, instead of a per-Task file. This way the
> > > > > >> StateStore
> > > > > >> > >> still
> > > > > >> > >> > technically owns its own checkpointing (via a default
> > > > > >> implementation),
> > > > > >> > >> and
> > > > > >> > >> > the StateManager/Task execution engine doesn't need to know
> > > > > >> anything
> > > > > >> > >> about
> > > > > >> > >> > checkpointing, which greatly simplifies some of the logic.
> > > > > >> > >> >
> > > > > >> > >> > 2. OOME errors: The main reasons why I didn't explore a
> > > > solution
> > > > > to
> > > > > >> > >> this is
> > > > > >> > >> > a) to keep this KIP as simple as possible, and b) because
> > I'm
> > > > not
> > > > > >> > >> exactly
> > > > > >> > >> > how to signal that a Task should commit prematurely. I'm
> > > > > confident
> > > > > >> > it's
> > > > > >> > >> > possible, and I think it's worth adding a section on
> > handling
> > > > > this.
> > > > > >> > >> Besides
> > > > > >> > >> > my proposal to force an early commit once memory usage
> > > reaches
> > > > a
> > > > > >> > >> threshold,
> > > > > >> > >> > is there any other approach that you might suggest for
> > > tackling
> > > > > >> this
> > > > > >> > >> > problem?
> > > > > >> > >> >
> > > > > >> > >> > 3. ALOS: I can add in an explicit paragraph, but my
> > > assumption
> > > > is
> > > > > >> that
> > > > > >> > >> > since transactional behaviour comes at little/no cost, that
> > > it
> > > > > >> should
> > > > > >> > be
> > > > > >> > >> > available by default on all stores, irrespective of the
> > > > > processing
> > > > > >> > mode.
> > > > > >> > >> > While ALOS doesn't use transactions, the Task itself still
> > > > > >> "commits",
> > > > > >> > so
> > > > > >> > >> > the behaviour should be correct under ALOS too. I'm not
> > > > convinced
> > > > > >> that
> > > > > >> > >> it's
> > > > > >> > >> > worth having both transactional/non-transactional stores
> > > > > >> available, as
> > > > > >> > >> it
> > > > > >> > >> > would considerably increase the complexity of the codebase,
> > > for
> > > > > >> very
> > > > > >> > >> little
> > > > > >> > >> > benefit.
> > > > > >> > >> >
> > > > > >> > >> > 4. Method deprecation: Are you referring to
> > > > > >> StateStore#getPosition()?
> > > > > >> > >> As I
> > > > > >> > >> > understand it, Position contains the position of the
> > *source*
> > > > > >> topics,
> > > > > >> > >> > whereas the commit offsets would be the *changelog*
> > offsets.
> > > So
> > > > > >> it's
> > > > > >> > >> still
> > > > > >> > >> > necessary to retain the Position data, as well as the
> > > changelog
> > > > > >> > offsets.
> > > > > >> > >> > What I meant in the KIP is that Position offsets are
> > > currently
> > > > > >> stored
> > > > > >> > >> in a
> > > > > >> > >> > file, and since we can atomically store metadata along with
> > > the
> > > > > >> record
> > > > > >> > >> > batch we commit to RocksDB, we can move our Position
> > offsets
> > > in
> > > > > to
> > > > > >> > this
> > > > > >> > >> > metadata too, and gain the same transactional guarantees
> > that
> > > > we
> > > > > >> will
> > > > > >> > >> for
> > > > > >> > >> > changelog offsets, ensuring that the Position offsets are
> > > > > >> consistent
> > > > > >> > >> with
> > > > > >> > >> > the records that are read from the database.
> > > > > >> > >> >
> > > > > >> > >> > Regards,
> > > > > >> > >> > Nick
> > > > > >> > >> >
> > > > > >> > >> > On Tue, 22 Nov 2022 at 16:25, John Roesler <
> > > > vvcep...@apache.org>
> > > > > >> > wrote:
> > > > > >> > >> >
> > > > > >> > >> > > Thanks for publishing this alternative, Nick!
> > > > > >> > >> > >
> > > > > >> > >> > > The benchmark you mentioned in the KIP-844 discussion
> > seems
> > > > > like
> > > > > >> a
> > > > > >> > >> > > compelling reason to revisit the built-in
> > transactionality
> > > > > >> > mechanism.
> > > > > >> > >> I
> > > > > >> > >> > > also appreciate you analysis, showing that for most use
> > > > cases,
> > > > > >> the
> > > > > >> > >> write
> > > > > >> > >> > > batch approach should be just fine.
> > > > > >> > >> > >
> > > > > >> > >> > > There are a couple of points that would hold me back from
> > > > > >> approving
> > > > > >> > >> this
> > > > > >> > >> > > KIP right now:
> > > > > >> > >> > >
> > > > > >> > >> > > 1. Loss of coverage for custom stores.
> > > > > >> > >> > > The fact that you can plug in a (relatively) simple
> > > > > >> implementation
> > > > > >> > of
> > > > > >> > >> the
> > > > > >> > >> > > XStateStore interfaces and automagically get a
> > distributed
> > > > > >> database
> > > > > >> > >> out
> > > > > >> > >> > of
> > > > > >> > >> > > it is a significant benefit of Kafka Streams. I'd hate to
> > > > lose
> > > > > >> it,
> > > > > >> > so
> > > > > >> > >> it
> > > > > >> > >> > > would be better to spend some time and come up with a way
> > > to
> > > > > >> > preserve
> > > > > >> > >> > that
> > > > > >> > >> > > property. For example, can we provide a default
> > > > implementation
> > > > > of
> > > > > >> > >> > > `commit(..)` that re-implements the existing
> > > checkpoint-file
> > > > > >> > >> approach? Or
> > > > > >> > >> > > perhaps add an `isTransactional()` flag to the state
> > store
> > > > > >> interface
> > > > > >> > >> so
> > > > > >> > >> > > that the runtime can decide whether to continue to manage
> > > > > >> checkpoint
> > > > > >> > >> > files
> > > > > >> > >> > > vs delegating transactionality to the stores?
> > > > > >> > >> > >
> > > > > >> > >> > > 2. Guarding against OOME
> > > > > >> > >> > > I appreciate your analysis, but I don't think it's
> > > sufficient
> > > > > to
> > > > > >> say
> > > > > >> > >> that
> > > > > >> > >> > > we will solve the memory problem later if it becomes
> > > > necessary.
> > > > > >> The
> > > > > >> > >> > > experience leading to that situation would be quite bad:
> > > > > Imagine,
> > > > > >> > you
> > > > > >> > >> > > upgrade to AK 3.next, your tests pass, so you deploy to
> > > > > >> production.
> > > > > >> > >> That
> > > > > >> > >> > > night, you get paged because your app is now crashing
> > with
> > > > > >> OOMEs. As
> > > > > >> > >> with
> > > > > >> > >> > > all OOMEs, you'll have a really hard time finding the
> > root
> > > > > cause,
> > > > > >> > and
> > > > > >> > >> > once
> > > > > >> > >> > > you do, you won't have a clear path to resolve the issue.
> > > You
> > > > > >> could
> > > > > >> > >> only
> > > > > >> > >> > > tune down the commit interval and cache buffer size until
> > > you
> > > > > >> stop
> > > > > >> > >> > getting
> > > > > >> > >> > > crashes.
> > > > > >> > >> > >
> > > > > >> > >> > > FYI, I know of multiple cases where people run EOS with
> > > much
> > > > > >> larger
> > > > > >> > >> > commit
> > > > > >> > >> > > intervals to get better batching than the default, so I
> > > don't
> > > > > >> think
> > > > > >> > >> this
> > > > > >> > >> > > pathological case would be as rare as you suspect.
> > > > > >> > >> > >
> > > > > >> > >> > > Given that we already have the rudiments of an idea of
> > what
> > > > we
> > > > > >> could
> > > > > >> > >> do
> > > > > >> > >> > to
> > > > > >> > >> > > prevent this downside, we should take the time to design
> > a
> > > > > >> solution.
> > > > > >> > >> We
> > > > > >> > >> > owe
> > > > > >> > >> > > it to our users to ensure that awesome new features don't
> > > > come
> > > > > >> with
> > > > > >> > >> > bitter
> > > > > >> > >> > > pills unless we can't avoid it.
> > > > > >> > >> > >
> > > > > >> > >> > > 3. ALOS mode.
> > > > > >> > >> > > On the other hand, I didn't see an indication of how
> > stores
> > > > > will
> > > > > >> be
> > > > > >> > >> > > handled under ALOS (aka non-EOS) mode. Theoretically, the
> > > > > >> > >> > transactionality
> > > > > >> > >> > > of the store and the processing mode are orthogonal. A
> > > > > >> transactional
> > > > > >> > >> > store
> > > > > >> > >> > > would serve ALOS just as well as a non-transactional one
> > > (if
> > > > > not
> > > > > >> > >> better).
> > > > > >> > >> > > Under ALOS, though, the default commit interval is five
> > > > > minutes,
> > > > > >> so
> > > > > >> > >> the
> > > > > >> > >> > > memory issue is far more pressing.
> > > > > >> > >> > >
> > > > > >> > >> > > As I see it, we have several options to resolve this
> > point.
> > > > We
> > > > > >> could
> > > > > >> > >> > > demonstrate that transactional stores work just fine for
> > > ALOS
> > > > > >> and we
> > > > > >> > >> can
> > > > > >> > >> > > therefore just swap over unconditionally. We could also
> > > > disable
> > > > > >> the
> > > > > >> > >> > > transactional mechanism under ALOS so that stores operate
> > > > just
> > > > > >> the
> > > > > >> > >> same
> > > > > >> > >> > as
> > > > > >> > >> > > they do today when run in ALOS mode. Finally, we could do
> > > the
> > > > > >> same
> > > > > >> > as
> > > > > >> > >> in
> > > > > >> > >> > > KIP-844 and make transactional stores opt-in (it'd be
> > > better
> > > > to
> > > > > >> > avoid
> > > > > >> > >> the
> > > > > >> > >> > > extra opt-in mechanism, but it's a good
> > > get-out-of-jail-free
> > > > > >> card).
> > > > > >> > >> > >
> > > > > >> > >> > > 4. (minor point) Deprecation of methods
> > > > > >> > >> > >
> > > > > >> > >> > > You mentioned that the new `commit` method replaces
> > flush,
> > > > > >> > >> > > updateChangelogOffsets, and checkpoint. It seems to me
> > that
> > > > the
> > > > > >> > point
> > > > > >> > >> > about
> > > > > >> > >> > > atomicity and Position also suggests that it replaces the
> > > > > >> Position
> > > > > >> > >> > > callbacks. However, the proposal only deprecates `flush`.
> > > > > Should
> > > > > >> we
> > > > > >> > be
> > > > > >> > >> > > deprecating other methods as well?
> > > > > >> > >> > >
> > > > > >> > >> > > Thanks again for the KIP! It's really nice that you and
> > > Alex
> > > > > will
> > > > > >> > get
> > > > > >> > >> the
> > > > > >> > >> > > chance to collaborate on both directions so that we can
> > get
> > > > the
> > > > > >> best
> > > > > >> > >> > > outcome for Streams and its users.
> > > > > >> > >> > >
> > > > > >> > >> > > -John
> > > > > >> > >> > >
> > > > > >> > >> > >
> > > > > >> > >> > > On 2022/11/21 15:02:15 Nick Telford wrote:
> > > > > >> > >> > > > Hi everyone,
> > > > > >> > >> > > >
> > > > > >> > >> > > > As I mentioned in the discussion thread for KIP-844,
> > I've
> > > > > been
> > > > > >> > >> working
> > > > > >> > >> > on
> > > > > >> > >> > > > an alternative approach to achieving better
> > transactional
> > > > > >> > semantics
> > > > > >> > >> for
> > > > > >> > >> > > > Kafka Streams StateStores.
> > > > > >> > >> > > >
> > > > > >> > >> > > > I've published this separately as KIP-892:
> > Transactional
> > > > > >> Semantics
> > > > > >> > >> for
> > > > > >> > >> > > > StateStores
> > > > > >> > >> > > > <
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> > > > > >> > >> > > >,
> > > > > >> > >> > > > so that it can be discussed/reviewed separately from
> > > > KIP-844.
> > > > > >> > >> > > >
> > > > > >> > >> > > > Alex: I'm especially interested in what you think!
> > > > > >> > >> > > >
> > > > > >> > >> > > > I have a nearly complete implementation of the changes
> > > > > >> outlined in
> > > > > >> > >> this
> > > > > >> > >> > > > KIP, please let me know if you'd like me to push them
> > for
> > > > > >> review
> > > > > >> > in
> > > > > >> > >> > > advance
> > > > > >> > >> > > > of a vote.
> > > > > >> > >> > > >
> > > > > >> > >> > > > Regards,
> > > > > >> > >> > > >
> > > > > >> > >> > > > Nick
> > > > > >> > >> > > >
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Reply via email to