Re: [DISCUSS] KIP-614: Add Prefix Scan support for State Stores

Sagar Thu, 14 May 2020 10:51:27 -0700

Hey @Adam,

Thanks for sharing your experience with using prefix seek. I did look at
your code for RocksDBPrefixIterator, infact I have repurposed that class
itself since it wasn't being used else where. Regarding how I plan to
expose them through-out the state stores, what I have tried to do is add it
as a separate interface. So, basically, it is not at the same level as the
*range function so to speak. The reason I did that is currently I feel not
all state stores are a natural fit for prefix seek. As I mentioned in the
KIP as well, the current equivalent to it could be BulkLoadingStore(not in
terms of functionality but in terms of how it is also not implemented by
all of them). So, that ways I am not needing to stub them across all the
state-stores and we can implement it only where needed. For example, in the
PR that I have put for reference in the KIP, you can see that I have it
implemented only for RocksDB.

@Guozhang,

Thanks for the feedback. Those are very interesting questions and I will
try my best to answer based upon whatever limited understanding I have
developed so far :)

1) Regarding the usage of useFixedLengthPrefixExtractor, honestly, I hadn't
looked at that config. I did look it up after you pointed it out and seems
it's more for hash-based memtables? I may be wrong though. But what I would
say is that, the changes I had made were not exactly from a correctness
stand point but more from trying to showcase how we can implement these
changes. The idea was that once we see the merit in this approach then we
can add some of the tunings( and I would need your team's assistance there
:D).

2) Regarding the similarity of `RocksDBPrefixIterator` and
`RocksDBRangeIterator`, yes the implementations look more or less similar.
So, in terms of performance, they might be similar. But semantically, they
can solve 2 different use-cases. The range seek is useful when we know both
from and to. But if we consider use-cases where we want to find keys with a
certain prefix, but we don't know if what it's start and end is, then
prefix seek would come in more handy. The point that I am trying to make is
that it can extend the scope of state stores from just point lookups to
somewhat being able to speculative queries where by users can search if a
certain pattern exists. I can vouch for this personally because I wanted to
use state stores for one such use case and since this option wasn't there,
I had to do some other things. An equivalent to this could be SCAN operator
in Redis. (Not trying to compare the Redis and state stores but trying to
give some context).

Regarding the point on bloom filter, I think there are certain
optimisations that are being talked about in case of prefix seek here:

https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#prefix-vs-whole-key
Again
this isn't something that I have explored fully. Also, on the prefix seek
page on RocksDB they mention that there's a prefix iterating technique
called Prefix Bloom Filter.

3) Regarding the question on length of bytes for seek v/s prefix seek, I am
not entirely sure about that scenario. What I have understood is that
at-least for Rocks DB, it is more performant for short iterator queries
that longer ones.

4) Regarding the last question on placing it within Segment, the reason I
didn't do that way, is that I thought we shouldn't tie this feature only to
RocksDB. I agree that I got this idea while looking/reading about RocksDB
but if we keep it outside the purview of RocksDB and keep it as a pluggable
entity, then a) it remains generic by not being tied to any specific store
and b) no change is needed at all for any of the other stores which haven't
implemented it.

I am not sure of any of the above points make sense but as I said, this is
based out of my limited understanding of the codebase. So, pardon any
incorrect/illogical statements plz!

@Sophie,

Thanks for bringing that point up! I have mentioned about that PR in the
KIP under a section called Other considerations. Nonetheless, thanks for
pointing it out!

Thanks!
Sagar.

On Thu, May 14, 2020 at 5:17 AM Sophie Blee-Goldman <sop...@confluent.io>
wrote:

> Not to derail this KIP discussion, but to leave a few notes on some of the
> RocksDB points that have come up:
>
> Someone actually merged some long overdue performance improvements to
> the RocksJava implementation (the PR was opened back in 2017! yikes).
> I haven't looked into the prefix seek API closely enough to know how
> relevant
> this particular change is, and they are still improving things, but it
> gives me some
> faith.
>
> There are some pretty promising results reported on the PR:
> https://github.com/facebook/rocksdb/pull/2283#issuecomment-561563037
>
> Regarding the custom comparator, they also recently merged this performance
> <https://github.com/facebook/rocksdb/pull/6252>
> improvement <https://github.com/facebook/rocksdb/pull/6252>. The tl;dr is
> they reduced the slowdown of a custom comparator in Java
> (relative to the native C++) from ~7x to ~5.2x at best. Which is still not
> great, but it
> would be interesting to run our own benchmarks and see how this stacks up.
>
> Of course, these are all new changes and as such will require us to upgrade
> rocks to 6.x which means they have to wait for us to release a 3.0. But
> there's
> some talk about 3.0 coming in the next few releases so consider it food for
> not-so-future thought
>
>
> On Tue, May 12, 2020 at 5:02 PM Adam Bellemare <adam.bellem...@gmail.com>
> wrote:
>
> > Hi Guozhang
> >
> > For clarity, the issues I was running into was not about the actual
> > *prefixSeek* function itself, but about exposing it to the same level of
> > access as the *range* function throughout Kafka Streams. It required a
> lot
> > of changes, and also required that most state stores stub it out since it
> > wasn't clear how they would implement it. It was basically an
> overreaching
> > API change that was easily solved (for the specific prefix-scan in FKJ)
> by
> > simply using *range*. So to be clear, the blockers were predominantly
> > around correctly handling the API changes, nothing to do with the
> > mechanisms of the RocksDB prefix scanning.
> >
> > As for KAFKA-5285 I'll look into it more to see if I can get a better
> > handle on the problem!
> >
> > Hope this helps clear it up.
> >
> > Adam
> >
> >
> > On Tue, May 12, 2020 at 7:16 PM Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > Hello Adam,
> > >
> > > I'm wondering if you can provide a bit more context on the blockers of
> > > using prefixSeek of RocksDB (I saw you have a RocksDBPrefixIterator
> class
> > > but not used anywhere yet)? I'm currently looking at ways to allow some
> > > secondary indices with rocksDB following some existing approaches
> > > from CockroachDB etc so I'm very curious to learn your experience.
> > >
> > > 1) Before considering any secondary indices, a quick thought is that
> for
> > > (key, timeFrom, timeTo) queries, we can easily replace the current
> > > `range()` impl with a `prefixRange()` impl via a prefix iterator;
> though
> > > for (keyFrom, keyTo, timeFrom, timeTo) it is much more complicated
> indeed
> > > and hence existing `range()` impl may still be used.
> > >
> > > 2) Another related issue I've been pondering for a while is
> > > around KAFKA-5285: with the default lexicograpic byte comparator, since
> > the
> > > key length varies, the combo (key, window) would have interleaving byte
> > > layouts like:
> > >
> > > AAA0001          (key AAA, timestamp 0001)
> > > AAA00011        (key AAA0, timestamp 0011)
> > > AAA0002          (key AAA, timestamp 0002)
> > >
> > > which is challenging for prefix seeks to work efficiently. Although we
> > can
> > > overwrite the byte-comparator in JNI it is very expensive and the cost
> of
> > > JNI overwhelms its benefits. If you've got some ideas around it please
> > lmk
> > > as well.
> > >
> > > Guozhang
> > >
> > >
> > >
> > >
> > > On Tue, May 12, 2020 at 6:26 AM Adam Bellemare <
> adam.bellem...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Sagar
> > > >
> > > > I implemented a very similar interface for KIP-213, the foreign-key
> > > joiner.
> > > > We pulled it out of the final implementation and instead used RocksDB
> > > range
> > > > instead. You can see the particular code where we use
> > RocksDB.range(...)
> > > to
> > > > get the same iterator result.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/foreignkeyjoin/ForeignJoinSubscriptionProcessorSupplier.java#L95
> > > >
> > > > We pulled it out because there were numerous awkward acrobatics to
> > > > integrate *prefixSeek()* function into the Kafka Streams code.
> > > Basically, I
> > > > wanted to be able to access *prefixSeek()* the same way I can access
> > > > *range()* for any state store, and in particular use it for storing
> > data
> > > > with a particular foreign key (as per the previous URL). However, I
> > found
> > > > out that it required way too many changes to expose the
> *prefixSeek()*
> > > > functionality while still being able to leverage all the nice Kafka
> > > Streams
> > > > state management + supplier functionality, so we made a decision just
> > to
> > > > stick with *range()* and pull everything else out.
> > > >
> > > > I guess my question here is, how do you anticipate using
> *prefixSeek()*
> > > > within the framework of Kafka Streams, or the Processor API?
> > > >
> > > > Adam
> > > >
> > > >
> > > >
> > > > On Tue, May 12, 2020 at 2:52 AM Sagar <sagarmeansoc...@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I would like to start a discussion on the KIP that I created below
> to
> > > add
> > > > > prefix scan support in State Stores:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-614: Add Prefix Scan support for State Stores

Reply via email to