I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.
I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly. I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput. Joel Bernstein Search Engineer at Heliosearch On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote: > Yeah, i dont think its from newer docvalues-using code like yours shai. > > instead the problems i had doing this are historical, because e.g. > fieldcache pointed to large arrays and consumers were lazy about it, > knowing that there reference pointed to bytes that would remain valid > across invocations. > > we just have to remove these assumptions. I don't apologize for not doing > this, as you show, its some small % improvement (which we should go and get > back!), but i went with safety first initially rather than bugs. > > > > On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote: > >> I agree with Robert. We should leave cloning BytesRefs to whoever needs >> that, and not penalize everyone else who don't need it. I must say I didn't >> know I can "own" those BytesRefs and I clone them whenever I need to. I >> think I was bitten by one of the other APIs, so I assumed returned >> BytesRefs are not "mine" across all the APIs. >> >> Shai >> >> >> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]> wrote: >> >>> the problem is really simpler to solve actually. >>> >>> Look at the comments in the code, it tells you why it is this way: >>> >>> // NOTE: we could have one buffer, but various consumers (e.g. >>> FieldComparatorSource) >>> // assume "they" own the bytes after calling this! >>> >>> That is what we should fix. There is no need to make bulk APIs or even >>> change the public api in any way (other than javadocs). >>> >>> We just move the clone'ing out of the codec, and require the consumer to >>> do it, same as termsenum or other apis. The codec part is extremely simple >>> here, its even the way i had it initially. >>> >>> But at the time (and even still now) this comes with some risk of bugs. >>> So initially I removed the reuse and went with a more conservative approach >>> to start with. >>> >>> >>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>> [email protected]> wrote: >>> >>>> Adrian, >>>> >>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>> ByteRef that provides 10% gain. >>>> ... >>>> bulkGet took:101630 ms >>>> ... >>>> get took:114422 ms >>>> >>>> >>>> >>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote: >>>> >>>>> I don't think we should add such a method. Doc values are commonly >>>>> read from collectors, so why do we need a method that works on top of >>>>> a DocIdSetIterator? I'm also curious how specialized implementations >>>>> could make this method faster than the default implementation? >>>>> >>>>> -- >>>>> Adrien >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> >>>> -- >>>> Sincerely yours >>>> Mikhail Khludnev >>>> Principal Engineer, >>>> Grid Dynamics >>>> >>>> <http://www.griddynamics.com> >>>> <[email protected]> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> >> >
