Thanks for the responses Nick.

>BB1 fair enough.

>BB2 - can we update the transactional thread safety part of the KIP with
the response you provided here?

>BB3b - I think that approach will work.  My main concern is to provide a
clear picture of the memory used and what's the interaction with both
caches and flushing.

On Tue, Apr 21, 2026 at 10:52 AM Nick Telford <[email protected]>
wrote:

> BB3b: Thinking a bit more about the cache vs. transaction buffer - what if
> we make a subtle change to the cache: when enable.transactional.statestores
> = true, we no longer cache writes (put/delete), only reads. Since the
> transaction buffer will already be storing recently written records (at
> least until commit), we don't need to keep them in another cache. We would
> still cache records read *from* the transaction buffer (because we would
> take the read of a recently written record as a signal that it's worth
> caching). The only caveat to this would be that the transaction buffer
> flushes recently written records on-commit, so cache hit-rate of recently
> written records will dip during commits.
>
> What do you think?
>
> On Tue, 21 Apr 2026 at 10:39, Nick Telford <[email protected]> wrote:
>
> > Hi Bill,
> >
> > Thanks for reviewing the updated KIP!
> >
> > BB1: I'm not quite sure I understand what you're saying here. We need to
> > take a copy/snapshot of the read buffer because the read buffer is
> emptied
> > on-commit, and any Iterator created on a ConcurrentSkipListMap
> > automatically reflects those changes, so would terminate iteration
> > prematurely. Also, any new writes to the buffer (put/delete) would be
> > immediately reflected in the iterator, making its results inconsistent.
> > TL;DR: copying it gives us full snapshot isolation for IQ iterators.
> > FWIW, the implementation already takes an Iterator over the underlying
> > store, as well as the buffer, and merges the two iterators together. But
> > the read buffer's Iterator must be over a copy of the read buffer, to
> > provide a consistent snapshot of the store.
> >
> > BB2: Basically, the RocksDBTransactionBuffer will maintain an extra
> "write
> > buffer", which uses a RocksDB WriteBatch to stage writes during
> > put/delete. This yields better throughput than iterating the
> > ConcurrentSkipListMap and writing to a WriteBatch during commit, and
> > reduces the time that IQ threads (that need a readLock on the
> > ConcurrentSkipListMap) are blocked from creating iterators during commit.
> >
> > InMemoryTransactionBuffers do not need this extra "write buffer", as
> there
> > is no more efficient implementation for them to use. This means that, vs.
> > RocksDB, InMemoryStores will actually use less memory buffering records.
> >
> > BB3: This is a good question, and one that I've not really dug too deeply
> > into. I suspect that there's a level of duplication of concerns here, but
> > there are subtle differences: the cache is an LRU cache, which can cache
> > records across multiple commits (if they're frequently read, for
> example),
> > whereas the transaction buffer always exclusively contains uncommitted
> > writes.
> >
> > In terms of correctness/consistency, I don't think we have a problem. For
> > READ_COMMITTED IQ, we skip both the cache and transaction buffers. For
> > READ_UNCOMMITTED IQ (and StreamThread reads) we read in the order
> dictated
> > by the StateStore layering, which is: cache first, then transaction
> buffer,
> > then underlying store. Since writes always update the cache, this is
> > correct. The one wrinkle is that cached writes are only written to
> > downstream store layers (and therefore buffered by the transaction
> buffer)
> > when the entry is evicted, or the cache is flushed or committed. This
> > essentially double-buffers writes, which is a bit wasteful of memory.
> >
> > I don't really know much about the cache, but as far as I can tell it
> > might eliminate some (de)serialization overhead? Regardless, given that
> it
> > implements a different behaviour (LRU vs. buffering uncommitted writes),
> > eliminating it might cause some subtle performance regressions for *some*
> > users.
> >
> > Maybe instead we have the *default* cache size change to 0 when
> > enable.transactional.statestores = true and
> > statestore.uncommitted.max.bytes > 0? That would reduce memory usage for
> > users that aren't explicitly relying on the cache.
> >
> > BB4: Ahh yes, good catch! This is showing how old this KIP is :-)
> >
> > Regards,
> > Nick
> >
> > On Mon, 20 Apr 2026 at 21:11, Bill Bejeck <[email protected]> wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for the update to the KIP!  It's great to see this moving again.
> >> Overall the KIP updates LGTM, but I do have a couple of comments
> >>
> >> BB1- The KIP specificies using a ConcurrentSkipListMap for the
> >> transactional buffer and for IQ READ_UNCOMMITTED queries KS will take a
> >> snapshot and during the copy lock the StreamThread with
> >> RenetrantReadWriteLock.
> >> Now if the IQ queries will only be served from the in-memory
> transactional
> >> buffer, that's an acceptable trade-off since even though the
> >> ConcurrentSkipListMap uses weakly consistent iterators, a commit
> >> mid-iteration would mean those records not viewed as a result of
> clearing
> >> the buffer.  Would it be worth exploring using IQ against the store and
> >> the
> >> buffer thus removing the copy+locking?  I think this could be
> >> worthwhile (unless just technically unfeasible) as the way it stands
> users
> >> will have to possibly make tradeoffs in performance by adjusting either
> >> the
> >> commit interval or the size of uncommitted bytes. Of course as specified
> >> in
> >> the KIP they have the option of going with READ_COMMITTED.
> >>
> >> BB2 - In the KIP section "Transaction Buffer Thread Safety" there's this
> >> sentence which I find confusing:
> >>
> >> > Reads from the StreamThread and READ_UNCOMMITTED IQ threads are
> serviced
> >> > by the read buffer. The write buffer is (optionally) used to buffer
> >> > writes for commit to the database, if doing so would yield better
> >> > performance than using the read buffer.
> >>
> >> Can you clarify what this means?
> >>
> >> BB3 - How does the `statestore.uncommitted.max.bytes` work in
> >> conjuction with the `statestore.cache.max.bytes.buffering` config?  If
> >> users elect to go with transactional stores is the statestore cache
> still
> >> needed?  Since the statestore cache flushes on commit() as well as the
> >> transactional cache what's the workflow like between the two?  The
> default
> >> size of the store cache is 10M so if it becomes full and a commit is not
> >> called  would it empty into the transactional cache?
> >>
> >> BB4- In discussing the `enable.transactional.statestores` config it
> >> mentions "exactly-once, exactly-once-v2 or exactly-once-beta" but we can
> >> go
> >> with just `exactly-once-v2` , we removed the others since 4.0
> >>
> >> Thanks!
> >> Bill
> >>
> >>
> >> On Fri, Apr 17, 2026 at 1:13 PM Nick Telford <[email protected]>
> >> wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > We're circling back to KIP-892, with the goal of including it in 4.4.
> To
> >> > that end, I've updated the KIP with substantial design improvements:
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892:+Transactional+Semantics+for+StateStores
> >> >
> >> > Highlights:
> >> >
> >> >    - TransactionBuffers that support both RocksDB and InMemory stores
> >> >    - Support for custom IsolationLevel of Interactive Queries,
> >> including a
> >> >    default (IQv1 and IQv2) and a per-request isolation level (IQv2
> only)
> >> >    - New store-level metric for uncommitted-bytes
> >> >
> >> > Let me know what you think!
> >> >
> >> > Regards,
> >> > Nick
> >> >
> >> > On Wed, 17 Apr 2024 at 11:50, Nick Telford <[email protected]>
> >> wrote:
> >> >
> >> > > Hi Walker,
> >> > >
> >> > > Feel free to ask away, either on the mailing list of the Confluent
> >> > > Community Slack, where I hang out :-)
> >> > >
> >> > > The implementation is *mostly* complete, although it needs some
> >> > polishing.
> >> > > It's worth noting that KIP-1035 is a hard prerequisite for this.
> >> > >
> >> > > Regards,
> >> > > Nick
> >> > >
> >> >
> >>
> >
>

Reply via email to