Hey Colt, I honestly don't really know whether a KIP would be strictly
required for this or not, but let me offer an alternative to what you
proposed that would both address this particular ambiguity and, to me,
provide a better mechanism for guaranteeing this as public contract of
RocksDB in general.

Earlier today someone proposed a KIP to expand the default DSL store type
specification to allow for custom state store implementations.
Fundamentally this involves introducing a new interface that basically
defines a single store type spec, such as RocksDB or in-memory, or whatever
other things people are plugging in. I think it's difficult to reason about
where to place this guarantee in the existing codebase, but it would be a
trivial addition to KIP-954 to include this as a public contract for all
RocksDB stores. To me, the assertion that a particular store honors
serialized byte ordering is a characteristic of the underlying store type
(eg rocks vs IM), which today is not really exposed anywhere in the public
API and barely a concept internally -- making it hard to see where an
appropriate place for this would be. However, if we expand the concept of a
store type spec in KIP-954 to include things like this, ie semantic
guarantees about the type itself, we'll have a clean solution to this
without the need or question of an additional KIP.

I actually think this idea of a semantic contract as part of the store spec
makes sense to add/call out in that KIP anyways, so I'll probably suggest
this on the discussion thread regardless of whether it's decided to be the
right solution for this particular concern. I like the idea that if/when
something like this comes up again in the future, we'll have a clear place
to make any changes to the public contract of internal store
implementations. Going forward, this should be done as a quick KIP.

Let me know what you think, and feel free to add your thoughts on this
(and/or the proposal itself) over on the KIP-954 [DISCUSS] thread

Cheers

On Thu, Jul 20, 2023 at 8:51 PM Colt McNealy <c...@littlehorse.io> wrote:

> Hi all,
>
> The [current documentation](
>
> https://kafka.apache.org/35/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#range(K,K)
> )  for ReadOnlyKeyValueStore#range() states that:
>
> > Order is not guaranteed as bytes lexicographical ordering might not
> represent key order.
>
> That makes sense—a the ordering of the two keys inserted via `store.put()`
> as determined by the `compareTo()` method is not what determines the
> ordering in the store; rather, it's the compareTo() of the serialized
> byte[] array that matters.
>
> Some observations after playing with it for over a year:
>
> A ) The behavior when you open a store for IQ and don't specify a specific
> partition is that (behind the scenes) a store is opened for one partition,
> and when that store is exhausted, then the next partition is opened. No
> guarantees about which partition is opened in what order. As such, if you
> just System.out.println() all the keys from the iterator, they are not
> ordered properly.
>
> B) WITHIN a partition, such as if you do a .withPartition() when requesting
> the ReadOnlyKeyValueStore, keys are indeed ordered properly according to
> the bytes produced by the key serializer.
>
> We at LittleHorse rely upon that behavior for API pagination, and if that
> behavior were to change it would break some things.
>
> After some digging, it turns out that the reason why we *do* indeed get
> lexicographical ordering of results according to the byte[] array of the
> keys is because that is a public contract exposed by RocksDB.
>
> I had asked Matthias offline if it would be possible to open a PR to
> clarify on the documentation that all results *within a partition of the
> Store* are ordered by the byte[] representation of the key, since I would
> feel more comfortable relying upon a publicly documented API.
>
> However, there are a few counterpoints to this:
>
> - ReadOnlyKeyValueStore is an *interface*, not an implementation. The
> lexicographical ordering is something we observe from the RocksDB
> implementation. If the store were implemented with, for example, a HashMap,
> this would not work.
>
> - The semantics of ordering thus seem to be more associated with the
> *implementation* rather than with the *interface*.
>
> - Is it possible at all to add a clarification on the RocksDB store that
> this behavior is a guarantee? Would that require a KIP?
>
> I'd be super-happy if I could open a PR to put a public documentation note
> somewhere on some implementation of a State Store that documents that this
> ordering by byte[] representation is guaranteed for range scans, but I do
> recognize that making a public documentation note is a contract, and as
> such may require a KIP and/or not be accepted.
>
> Any thoughts?
>
> Thanks for reading,
> Colt McNealy
>
> *Founder, LittleHorse.dev*
>

Reply via email to