Re: [DISCUSS] KIP-614: Add Prefix Scan support for State Stores

Sophie Blee-Goldman Thu, 14 May 2020 14:18:26 -0700

Whoops, I guess I didn't finish reading the KIP all the way to the end
earlier. Thanks
for including the link to the RocksDB PR in the KIP!


I have one additional question about the proposal: do you plan to also add
this
prefix seek API to the dual column family iterators? These are used by
RocksDBTimestampedStore (which extends RocksDBStore), for example the
*RocksDBDualCFRangeIterator*

Thanks for the KIP!

On Thu, May 14, 2020 at 10:50 AM Sagar <sagarmeansoc...@gmail.com> wrote:

> Hey @Adam,
>
> Thanks for sharing your experience with using prefix seek. I did look at
> your code for RocksDBPrefixIterator, infact I have repurposed that class
> itself since it wasn't being used else where. Regarding how I plan to
> expose them through-out the state stores, what I have tried to do is add it
> as a separate interface. So, basically, it is not at the same level as the
> *range function so to speak. The reason I did that is currently I feel not
> all state stores are a natural fit for prefix seek. As I mentioned in the
> KIP as well, the current equivalent to it could be BulkLoadingStore(not in
> terms of functionality but in terms of how it is also not implemented by
> all of them). So, that ways I am not needing to stub them across all the
> state-stores and we can implement it only where needed. For example, in the
> PR that I have put for reference in the KIP, you can see that I have it
> implemented only for RocksDB.
>
> @Guozhang,
>
> Thanks for the feedback. Those are very interesting questions and I will
> try my best to answer based upon whatever limited understanding I have
> developed so far :)
>
> 1) Regarding the usage of useFixedLengthPrefixExtractor, honestly, I hadn't
> looked at that config. I did look it up after you pointed it out and seems
> it's more for hash-based memtables? I may be wrong though. But what I would
> say is that, the changes I had made were not exactly from a correctness
> stand point but more from trying to showcase how we can implement these
> changes. The idea was that once we see the merit in this approach then we
> can add some of the tunings( and I would need your team's assistance there
> :D).
>
> 2) Regarding the similarity of `RocksDBPrefixIterator` and
> `RocksDBRangeIterator`, yes the implementations look more or less similar.
> So, in terms of performance, they might be similar. But semantically, they
> can solve 2 different use-cases. The range seek is useful when we know both
> from and to. But if we consider use-cases where we want to find keys with a
> certain prefix, but we don't know if what it's start and end is, then
> prefix seek would come in more handy. The point that I am trying to make is
> that it can extend the scope of state stores from just point lookups to
> somewhat being able to speculative queries where by users can search if a
> certain pattern exists. I can vouch for this personally because I wanted to
> use state stores for one such use case and since this option wasn't there,
> I had to do some other things. An equivalent to this could be SCAN operator
> in Redis. (Not trying to compare the Redis and state stores but trying to
> give some context).
>
> Regarding the point on bloom filter, I think there are certain
> optimisations that are being talked about in case of prefix seek here:
>
>
> https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#prefix-vs-whole-key
> Again
> this isn't something that I have explored fully. Also, on the prefix seek
> page on RocksDB they mention that there's a prefix iterating technique
> called Prefix Bloom Filter.
>
> 3) Regarding the question on length of bytes for seek v/s prefix seek, I am
> not entirely sure about that scenario. What I have understood is that
> at-least for Rocks DB, it is more performant for short iterator queries
> that longer ones.
>
> 4) Regarding the last question on placing it within Segment, the reason I
> didn't do that way, is that I thought we shouldn't tie this feature only to
> RocksDB. I agree that I got this idea while looking/reading about RocksDB
> but if we keep it outside the purview of RocksDB and keep it as a pluggable
> entity, then a) it remains generic by not being tied to any specific store
> and b) no change is needed at all for any of the other stores which haven't
> implemented it.
>
> I am not sure of any of the above points make sense but as I said, this is
> based out of my limited understanding of the codebase. So, pardon any
> incorrect/illogical statements plz!
>
> @Sophie,
>
> Thanks for bringing that point up! I have mentioned about that PR in the
> KIP under a section called Other considerations. Nonetheless, thanks for
> pointing it out!
>
> Thanks!
> Sagar.
>
>
> On Thu, May 14, 2020 at 5:17 AM Sophie Blee-Goldman <sop...@confluent.io>
> wrote:
>
> > Not to derail this KIP discussion, but to leave a few notes on some of
> the
> > RocksDB points that have come up:
> >
> > Someone actually merged some long overdue performance improvements to
> > the RocksJava implementation (the PR was opened back in 2017! yikes).
> > I haven't looked into the prefix seek API closely enough to know how
> > relevant
> > this particular change is, and they are still improving things, but it
> > gives me some
> > faith.
> >
> > There are some pretty promising results reported on the PR:
> > https://github.com/facebook/rocksdb/pull/2283#issuecomment-561563037
> >
> > Regarding the custom comparator, they also recently merged this
> performance
> > <https://github.com/facebook/rocksdb/pull/6252>
> > improvement <https://github.com/facebook/rocksdb/pull/6252>. The tl;dr
> is
> > they reduced the slowdown of a custom comparator in Java
> > (relative to the native C++) from ~7x to ~5.2x at best. Which is still
> not
> > great, but it
> > would be interesting to run our own benchmarks and see how this stacks
> up.
> >
> > Of course, these are all new changes and as such will require us to
> upgrade
> > rocks to 6.x which means they have to wait for us to release a 3.0. But
> > there's
> > some talk about 3.0 coming in the next few releases so consider it food
> for
> > not-so-future thought
> >
> >
> > On Tue, May 12, 2020 at 5:02 PM Adam Bellemare <adam.bellem...@gmail.com
> >
> > wrote:
> >
> > > Hi Guozhang
> > >
> > > For clarity, the issues I was running into was not about the actual
> > > *prefixSeek* function itself, but about exposing it to the same level
> of
> > > access as the *range* function throughout Kafka Streams. It required a
> > lot
> > > of changes, and also required that most state stores stub it out since
> it
> > > wasn't clear how they would implement it. It was basically an
> > overreaching
> > > API change that was easily solved (for the specific prefix-scan in FKJ)
> > by
> > > simply using *range*. So to be clear, the blockers were predominantly
> > > around correctly handling the API changes, nothing to do with the
> > > mechanisms of the RocksDB prefix scanning.
> > >
> > > As for KAFKA-5285 I'll look into it more to see if I can get a better
> > > handle on the problem!
> > >
> > > Hope this helps clear it up.
> > >
> > > Adam
> > >
> > >
> > > On Tue, May 12, 2020 at 7:16 PM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Hello Adam,
> > > >
> > > > I'm wondering if you can provide a bit more context on the blockers
> of
> > > > using prefixSeek of RocksDB (I saw you have a RocksDBPrefixIterator
> > class
> > > > but not used anywhere yet)? I'm currently looking at ways to allow
> some
> > > > secondary indices with rocksDB following some existing approaches
> > > > from CockroachDB etc so I'm very curious to learn your experience.
> > > >
> > > > 1) Before considering any secondary indices, a quick thought is that
> > for
> > > > (key, timeFrom, timeTo) queries, we can easily replace the current
> > > > `range()` impl with a `prefixRange()` impl via a prefix iterator;
> > though
> > > > for (keyFrom, keyTo, timeFrom, timeTo) it is much more complicated
> > indeed
> > > > and hence existing `range()` impl may still be used.
> > > >
> > > > 2) Another related issue I've been pondering for a while is
> > > > around KAFKA-5285: with the default lexicograpic byte comparator,
> since
> > > the
> > > > key length varies, the combo (key, window) would have interleaving
> byte
> > > > layouts like:
> > > >
> > > > AAA0001          (key AAA, timestamp 0001)
> > > > AAA00011        (key AAA0, timestamp 0011)
> > > > AAA0002          (key AAA, timestamp 0002)
> > > >
> > > > which is challenging for prefix seeks to work efficiently. Although
> we
> > > can
> > > > overwrite the byte-comparator in JNI it is very expensive and the
> cost
> > of
> > > > JNI overwhelms its benefits. If you've got some ideas around it
> please
> > > lmk
> > > > as well.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, May 12, 2020 at 6:26 AM Adam Bellemare <
> > adam.bellem...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Sagar
> > > > >
> > > > > I implemented a very similar interface for KIP-213, the foreign-key
> > > > joiner.
> > > > > We pulled it out of the final implementation and instead used
> RocksDB
> > > > range
> > > > > instead. You can see the particular code where we use
> > > RocksDB.range(...)
> > > > to
> > > > > get the same iterator result.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/foreignkeyjoin/ForeignJoinSubscriptionProcessorSupplier.java#L95
> > > > >
> > > > > We pulled it out because there were numerous awkward acrobatics to
> > > > > integrate *prefixSeek()* function into the Kafka Streams code.
> > > > Basically, I
> > > > > wanted to be able to access *prefixSeek()* the same way I can
> access
> > > > > *range()* for any state store, and in particular use it for storing
> > > data
> > > > > with a particular foreign key (as per the previous URL). However, I
> > > found
> > > > > out that it required way too many changes to expose the
> > *prefixSeek()*
> > > > > functionality while still being able to leverage all the nice Kafka
> > > > Streams
> > > > > state management + supplier functionality, so we made a decision
> just
> > > to
> > > > > stick with *range()* and pull everything else out.
> > > > >
> > > > > I guess my question here is, how do you anticipate using
> > *prefixSeek()*
> > > > > within the framework of Kafka Streams, or the Processor API?
> > > > >
> > > > > Adam
> > > > >
> > > > >
> > > > >
> > > > > On Tue, May 12, 2020 at 2:52 AM Sagar <sagarmeansoc...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I would like to start a discussion on the KIP that I created
> below
> > to
> > > > add
> > > > > > prefix scan support in State Stores:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores
> > > > > >
> > > > > > Thanks!
> > > > > > Sagar.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-614: Add Prefix Scan support for State Stores

Reply via email to