what are you doing with the data?
On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]> wrote: > I'll provide a little more context. I'm working on bulk extracting > BinaryDocValues. My initial performance test was with in-memory > binaryDocValues, but I think the end game is actually disk-based > binaryDocValues. > > I was able to perform around 1 million docId->BytesRef lookups per-second > with in-memory BinaryDocValues. Since I need to get the values for multiple > fields for each document, this bogs down pretty quickly. > > I'm wondering if there is a way to increase this throughput. Since filling > a BytesRef is pretty fast, I was assuming it was the seek that was taking > the time, but I didn't verify this. The first thing that came to mind is > iterating the docValues in such a way that the next docValue could be > loaded without a seek. But I haven't dug into how the BinaryDocValues are > formatted so I'm not sure if this would help or not. Also there could be > something else besides the seek that is limiting the throughput. > > > > > > > > > Joel Bernstein > Search Engineer at Heliosearch > > > On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote: > >> Yeah, i dont think its from newer docvalues-using code like yours shai. >> >> instead the problems i had doing this are historical, because e.g. >> fieldcache pointed to large arrays and consumers were lazy about it, >> knowing that there reference pointed to bytes that would remain valid >> across invocations. >> >> we just have to remove these assumptions. I don't apologize for not doing >> this, as you show, its some small % improvement (which we should go and get >> back!), but i went with safety first initially rather than bugs. >> >> >> >> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote: >> >>> I agree with Robert. We should leave cloning BytesRefs to whoever needs >>> that, and not penalize everyone else who don't need it. I must say I didn't >>> know I can "own" those BytesRefs and I clone them whenever I need to. I >>> think I was bitten by one of the other APIs, so I assumed returned >>> BytesRefs are not "mine" across all the APIs. >>> >>> Shai >>> >>> >>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]> wrote: >>> >>>> the problem is really simpler to solve actually. >>>> >>>> Look at the comments in the code, it tells you why it is this way: >>>> >>>> // NOTE: we could have one buffer, but various consumers >>>> (e.g. FieldComparatorSource) >>>> // assume "they" own the bytes after calling this! >>>> >>>> That is what we should fix. There is no need to make bulk APIs or even >>>> change the public api in any way (other than javadocs). >>>> >>>> We just move the clone'ing out of the codec, and require the consumer >>>> to do it, same as termsenum or other apis. The codec part is extremely >>>> simple here, its even the way i had it initially. >>>> >>>> But at the time (and even still now) this comes with some risk of bugs. >>>> So initially I removed the reuse and went with a more conservative approach >>>> to start with. >>>> >>>> >>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev < >>>> [email protected]> wrote: >>>> >>>>> Adrian, >>>>> >>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses >>>>> ByteRef that provides 10% gain. >>>>> ... >>>>> bulkGet took:101630 ms >>>>> ... >>>>> get took:114422 ms >>>>> >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote: >>>>> >>>>>> I don't think we should add such a method. Doc values are commonly >>>>>> read from collectors, so why do we need a method that works on top of >>>>>> a DocIdSetIterator? I'm also curious how specialized implementations >>>>>> could make this method faster than the default implementation? >>>>>> >>>>>> -- >>>>>> Adrien >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Sincerely yours >>>>> Mikhail Khludnev >>>>> Principal Engineer, >>>>> Grid Dynamics >>>>> >>>>> <http://www.griddynamics.com> >>>>> <[email protected]> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> >>>> >>> >> >
