Hey Colt, I honestly don't really know whether a KIP would be strictly required for this or not, but let me offer an alternative to what you proposed that would both address this particular ambiguity and, to me, provide a better mechanism for guaranteeing this as public contract of RocksDB in general.
Earlier today someone proposed a KIP to expand the default DSL store type specification to allow for custom state store implementations. Fundamentally this involves introducing a new interface that basically defines a single store type spec, such as RocksDB or in-memory, or whatever other things people are plugging in. I think it's difficult to reason about where to place this guarantee in the existing codebase, but it would be a trivial addition to KIP-954 to include this as a public contract for all RocksDB stores. To me, the assertion that a particular store honors serialized byte ordering is a characteristic of the underlying store type (eg rocks vs IM), which today is not really exposed anywhere in the public API and barely a concept internally -- making it hard to see where an appropriate place for this would be. However, if we expand the concept of a store type spec in KIP-954 to include things like this, ie semantic guarantees about the type itself, we'll have a clean solution to this without the need or question of an additional KIP. I actually think this idea of a semantic contract as part of the store spec makes sense to add/call out in that KIP anyways, so I'll probably suggest this on the discussion thread regardless of whether it's decided to be the right solution for this particular concern. I like the idea that if/when something like this comes up again in the future, we'll have a clear place to make any changes to the public contract of internal store implementations. Going forward, this should be done as a quick KIP. Let me know what you think, and feel free to add your thoughts on this (and/or the proposal itself) over on the KIP-954 [DISCUSS] thread Cheers On Thu, Jul 20, 2023 at 8:51 PM Colt McNealy <c...@littlehorse.io> wrote: > Hi all, > > The [current documentation]( > > https://kafka.apache.org/35/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#range(K,K) > ) for ReadOnlyKeyValueStore#range() states that: > > > Order is not guaranteed as bytes lexicographical ordering might not > represent key order. > > That makes sense—a the ordering of the two keys inserted via `store.put()` > as determined by the `compareTo()` method is not what determines the > ordering in the store; rather, it's the compareTo() of the serialized > byte[] array that matters. > > Some observations after playing with it for over a year: > > A ) The behavior when you open a store for IQ and don't specify a specific > partition is that (behind the scenes) a store is opened for one partition, > and when that store is exhausted, then the next partition is opened. No > guarantees about which partition is opened in what order. As such, if you > just System.out.println() all the keys from the iterator, they are not > ordered properly. > > B) WITHIN a partition, such as if you do a .withPartition() when requesting > the ReadOnlyKeyValueStore, keys are indeed ordered properly according to > the bytes produced by the key serializer. > > We at LittleHorse rely upon that behavior for API pagination, and if that > behavior were to change it would break some things. > > After some digging, it turns out that the reason why we *do* indeed get > lexicographical ordering of results according to the byte[] array of the > keys is because that is a public contract exposed by RocksDB. > > I had asked Matthias offline if it would be possible to open a PR to > clarify on the documentation that all results *within a partition of the > Store* are ordered by the byte[] representation of the key, since I would > feel more comfortable relying upon a publicly documented API. > > However, there are a few counterpoints to this: > > - ReadOnlyKeyValueStore is an *interface*, not an implementation. The > lexicographical ordering is something we observe from the RocksDB > implementation. If the store were implemented with, for example, a HashMap, > this would not work. > > - The semantics of ordering thus seem to be more associated with the > *implementation* rather than with the *interface*. > > - Is it possible at all to add a clarification on the RocksDB store that > this behavior is a guarantee? Would that require a KIP? > > I'd be super-happy if I could open a PR to put a public documentation note > somewhere on some implementation of a State Store that documents that this > ordering by byte[] representation is guaranteed for range scans, but I do > recognize that making a public documentation note is a contract, and as > such may require a KIP and/or not be accepted. > > Any thoughts? > > Thanks for reading, > Colt McNealy > > *Founder, LittleHorse.dev* >