Hi Bill, I've updated the "State Store Caching" section to explain the main caveat (regarding memory accounting) and noted that we'll revisit this in the future.
I believe all your other points have been addressed. Is there anything else you'd like me to clarify? Regards, Nick On Thu, 30 Apr 2026 at 15:09, Bill Bejeck <[email protected]> wrote: > > > > I'm not sure there's a way to preserve the > > caching of intermediate writes for aggregations without effectively > > "double-buffering" writes. > > > Yes, I beggining to have the same opinion, we'll see as the development > progresses. > > -Bill > > On Thu, Apr 30, 2026 at 8:38 AM Nick Telford <[email protected]> > wrote: > > > Regarding the cache: if we disable write-caching when transactional > stores > > are enabled, that will disable the caching of intermediate values for > > aggregations, causing aggregation processors to run on every write. > > Depending on the application, this could be negligible (which I believe > > will be the case for basically any builtin aggregation), or it could be > > significant (if the aggregation processor does anything particularly > > heavyweight, like I/O). > > > > Is this an acceptable trade-off? I'm not sure there's a way to preserve > the > > caching of intermediate writes for aggregations without effectively > > "double-buffering" writes. > > > > Could we also add the cache's "flush to downstream processors only on > > commit" behaviour to our transaction buffers? My concern with that is > that > > our transaction buffering requires a WriteLock during commit, to prevent > IQ > > threads seeing an inconsistent view of the store. Therefore, store commit > > should not block while awaiting extensive downstream processing. > > > > Regards, > > Nick > > > > On Thu, 30 Apr 2026 at 10:34, Nick Telford <[email protected]> > wrote: > > > > > Thanks Bill! > > > > > > I've updated the KIP as requested: > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores > > > > > > 1. Updated the section on thread safety with specifics about > RocksDBStore > > > and InMemory stores. > > > 2. Added a section on State Store Caching > > > > > > Let me know what you think, especially about the caching section, and > if > > > there's any other changes you'd like to see. > > > > > > Cheers, > > > Nick > > > > > > On Wed, 29 Apr 2026 at 21:57, Bill Bejeck <[email protected]> wrote: > > > > > >> Thanks for the responses Nick. > > >> > > >> >BB1 fair enough. > > >> > > >> >BB2 - can we update the transactional thread safety part of the KIP > > with > > >> the response you provided here? > > >> > > >> >BB3b - I think that approach will work. My main concern is to > provide > > a > > >> clear picture of the memory used and what's the interaction with both > > >> caches and flushing. > > >> > > >> On Tue, Apr 21, 2026 at 10:52 AM Nick Telford <[email protected] > > > > >> wrote: > > >> > > >> > BB3b: Thinking a bit more about the cache vs. transaction buffer - > > what > > >> if > > >> > we make a subtle change to the cache: when > > >> enable.transactional.statestores > > >> > = true, we no longer cache writes (put/delete), only reads. Since > the > > >> > transaction buffer will already be storing recently written records > > (at > > >> > least until commit), we don't need to keep them in another cache. We > > >> would > > >> > still cache records read *from* the transaction buffer (because we > > would > > >> > take the read of a recently written record as a signal that it's > worth > > >> > caching). The only caveat to this would be that the transaction > buffer > > >> > flushes recently written records on-commit, so cache hit-rate of > > >> recently > > >> > written records will dip during commits. > > >> > > > >> > What do you think? > > >> > > > >> > On Tue, 21 Apr 2026 at 10:39, Nick Telford <[email protected]> > > >> wrote: > > >> > > > >> > > Hi Bill, > > >> > > > > >> > > Thanks for reviewing the updated KIP! > > >> > > > > >> > > BB1: I'm not quite sure I understand what you're saying here. We > > need > > >> to > > >> > > take a copy/snapshot of the read buffer because the read buffer is > > >> > emptied > > >> > > on-commit, and any Iterator created on a ConcurrentSkipListMap > > >> > > automatically reflects those changes, so would terminate iteration > > >> > > prematurely. Also, any new writes to the buffer (put/delete) would > > be > > >> > > immediately reflected in the iterator, making its results > > >> inconsistent. > > >> > > TL;DR: copying it gives us full snapshot isolation for IQ > iterators. > > >> > > FWIW, the implementation already takes an Iterator over the > > underlying > > >> > > store, as well as the buffer, and merges the two iterators > together. > > >> But > > >> > > the read buffer's Iterator must be over a copy of the read buffer, > > to > > >> > > provide a consistent snapshot of the store. > > >> > > > > >> > > BB2: Basically, the RocksDBTransactionBuffer will maintain an > extra > > >> > "write > > >> > > buffer", which uses a RocksDB WriteBatch to stage writes during > > >> > > put/delete. This yields better throughput than iterating the > > >> > > ConcurrentSkipListMap and writing to a WriteBatch during commit, > and > > >> > > reduces the time that IQ threads (that need a readLock on the > > >> > > ConcurrentSkipListMap) are blocked from creating iterators during > > >> commit. > > >> > > > > >> > > InMemoryTransactionBuffers do not need this extra "write buffer", > as > > >> > there > > >> > > is no more efficient implementation for them to use. This means > > that, > > >> vs. > > >> > > RocksDB, InMemoryStores will actually use less memory buffering > > >> records. > > >> > > > > >> > > BB3: This is a good question, and one that I've not really dug too > > >> deeply > > >> > > into. I suspect that there's a level of duplication of concerns > > here, > > >> but > > >> > > there are subtle differences: the cache is an LRU cache, which can > > >> cache > > >> > > records across multiple commits (if they're frequently read, for > > >> > example), > > >> > > whereas the transaction buffer always exclusively contains > > uncommitted > > >> > > writes. > > >> > > > > >> > > In terms of correctness/consistency, I don't think we have a > > problem. > > >> For > > >> > > READ_COMMITTED IQ, we skip both the cache and transaction buffers. > > For > > >> > > READ_UNCOMMITTED IQ (and StreamThread reads) we read in the order > > >> > dictated > > >> > > by the StateStore layering, which is: cache first, then > transaction > > >> > buffer, > > >> > > then underlying store. Since writes always update the cache, this > is > > >> > > correct. The one wrinkle is that cached writes are only written to > > >> > > downstream store layers (and therefore buffered by the transaction > > >> > buffer) > > >> > > when the entry is evicted, or the cache is flushed or committed. > > This > > >> > > essentially double-buffers writes, which is a bit wasteful of > > memory. > > >> > > > > >> > > I don't really know much about the cache, but as far as I can tell > > it > > >> > > might eliminate some (de)serialization overhead? Regardless, given > > >> that > > >> > it > > >> > > implements a different behaviour (LRU vs. buffering uncommitted > > >> writes), > > >> > > eliminating it might cause some subtle performance regressions for > > >> *some* > > >> > > users. > > >> > > > > >> > > Maybe instead we have the *default* cache size change to 0 when > > >> > > enable.transactional.statestores = true and > > >> > > statestore.uncommitted.max.bytes > 0? That would reduce memory > usage > > >> for > > >> > > users that aren't explicitly relying on the cache. > > >> > > > > >> > > BB4: Ahh yes, good catch! This is showing how old this KIP is :-) > > >> > > > > >> > > Regards, > > >> > > Nick > > >> > > > > >> > > On Mon, 20 Apr 2026 at 21:11, Bill Bejeck <[email protected]> > > wrote: > > >> > > > > >> > >> Hi Nick, > > >> > >> > > >> > >> Thanks for the update to the KIP! It's great to see this moving > > >> again. > > >> > >> Overall the KIP updates LGTM, but I do have a couple of comments > > >> > >> > > >> > >> BB1- The KIP specificies using a ConcurrentSkipListMap for the > > >> > >> transactional buffer and for IQ READ_UNCOMMITTED queries KS will > > >> take a > > >> > >> snapshot and during the copy lock the StreamThread with > > >> > >> RenetrantReadWriteLock. > > >> > >> Now if the IQ queries will only be served from the in-memory > > >> > transactional > > >> > >> buffer, that's an acceptable trade-off since even though the > > >> > >> ConcurrentSkipListMap uses weakly consistent iterators, a commit > > >> > >> mid-iteration would mean those records not viewed as a result of > > >> > clearing > > >> > >> the buffer. Would it be worth exploring using IQ against the > store > > >> and > > >> > >> the > > >> > >> buffer thus removing the copy+locking? I think this could be > > >> > >> worthwhile (unless just technically unfeasible) as the way it > > stands > > >> > users > > >> > >> will have to possibly make tradeoffs in performance by adjusting > > >> either > > >> > >> the > > >> > >> commit interval or the size of uncommitted bytes. Of course as > > >> specified > > >> > >> in > > >> > >> the KIP they have the option of going with READ_COMMITTED. > > >> > >> > > >> > >> BB2 - In the KIP section "Transaction Buffer Thread Safety" > there's > > >> this > > >> > >> sentence which I find confusing: > > >> > >> > > >> > >> > Reads from the StreamThread and READ_UNCOMMITTED IQ threads are > > >> > serviced > > >> > >> > by the read buffer. The write buffer is (optionally) used to > > buffer > > >> > >> > writes for commit to the database, if doing so would yield > better > > >> > >> > performance than using the read buffer. > > >> > >> > > >> > >> Can you clarify what this means? > > >> > >> > > >> > >> BB3 - How does the `statestore.uncommitted.max.bytes` work in > > >> > >> conjuction with the `statestore.cache.max.bytes.buffering` > config? > > >> If > > >> > >> users elect to go with transactional stores is the statestore > cache > > >> > still > > >> > >> needed? Since the statestore cache flushes on commit() as well > as > > >> the > > >> > >> transactional cache what's the workflow like between the two? > The > > >> > default > > >> > >> size of the store cache is 10M so if it becomes full and a commit > > is > > >> not > > >> > >> called would it empty into the transactional cache? > > >> > >> > > >> > >> BB4- In discussing the `enable.transactional.statestores` config > it > > >> > >> mentions "exactly-once, exactly-once-v2 or exactly-once-beta" but > > we > > >> can > > >> > >> go > > >> > >> with just `exactly-once-v2` , we removed the others since 4.0 > > >> > >> > > >> > >> Thanks! > > >> > >> Bill > > >> > >> > > >> > >> > > >> > >> On Fri, Apr 17, 2026 at 1:13 PM Nick Telford < > > [email protected] > > >> > > > >> > >> wrote: > > >> > >> > > >> > >> > Hi everyone, > > >> > >> > > > >> > >> > We're circling back to KIP-892, with the goal of including it > in > > >> 4.4. > > >> > To > > >> > >> > that end, I've updated the KIP with substantial design > > >> improvements: > > >> > >> > > > >> > >> > > > >> > >> > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-892:+Transactional+Semantics+for+StateStores > > >> > >> > > > >> > >> > Highlights: > > >> > >> > > > >> > >> > - TransactionBuffers that support both RocksDB and InMemory > > >> stores > > >> > >> > - Support for custom IsolationLevel of Interactive Queries, > > >> > >> including a > > >> > >> > default (IQv1 and IQv2) and a per-request isolation level > > (IQv2 > > >> > only) > > >> > >> > - New store-level metric for uncommitted-bytes > > >> > >> > > > >> > >> > Let me know what you think! > > >> > >> > > > >> > >> > Regards, > > >> > >> > Nick > > >> > >> > > > >> > >> > On Wed, 17 Apr 2024 at 11:50, Nick Telford < > > [email protected] > > >> > > > >> > >> wrote: > > >> > >> > > > >> > >> > > Hi Walker, > > >> > >> > > > > >> > >> > > Feel free to ask away, either on the mailing list of the > > >> Confluent > > >> > >> > > Community Slack, where I hang out :-) > > >> > >> > > > > >> > >> > > The implementation is *mostly* complete, although it needs > some > > >> > >> > polishing. > > >> > >> > > It's worth noting that KIP-1035 is a hard prerequisite for > > this. > > >> > >> > > > > >> > >> > > Regards, > > >> > >> > > Nick > > >> > >> > > > > >> > >> > > > >> > >> > > >> > > > > >> > > > >> > > > > > >
