Re: Iterating BinaryDocValues

Robert Muir Fri, 10 Jan 2014 12:27:15 -0800

what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]> wrote:

> I'll provide a little more context. I'm working on bulk extracting
> BinaryDocValues. My initial performance test was with in-memory
> binaryDocValues, but I think the end game is actually disk-based
> binaryDocValues.
>
> I was able to perform around 1 million docId->BytesRef lookups per-second
> with in-memory BinaryDocValues. Since I need to get the values for multiple
> fields for each document, this bogs down pretty quickly.
>
> I'm wondering if there is a way to increase this throughput. Since filling
> a BytesRef is pretty fast, I was assuming it was the seek that was taking
> the time, but I didn't verify this. The first thing that came to mind is
> iterating the docValues in such a way that the next docValue could be
> loaded without a seek. But I haven't dug into how the BinaryDocValues are
> formatted so I'm not sure if this would help or not. Also there could be
> something else besides the seek that is limiting the throughput.
>
>
>
>
>
>
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote:
>
>> Yeah, i dont think its from newer docvalues-using code like yours shai.
>>
>> instead the problems i had doing this are historical, because e.g.
>> fieldcache pointed to large arrays and consumers were lazy about it,
>> knowing that there reference pointed to bytes that would remain valid
>> across invocations.
>>
>> we just have to remove these assumptions. I don't apologize for not doing
>> this, as you show, its some small % improvement (which we should go and get
>> back!), but i went with safety first initially rather than bugs.
>>
>>
>>
>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote:
>>
>>> I agree with Robert. We should leave cloning BytesRefs to whoever needs
>>> that, and not penalize everyone else who don't need it. I must say I didn't
>>> know I can "own" those BytesRefs and I clone them whenever I need to. I
>>> think I was bitten by one of the other APIs, so I assumed returned
>>> BytesRefs are not "mine" across all the APIs.
>>>
>>> Shai
>>>
>>>
>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]> wrote:
>>>
>>>> the problem is really simpler to solve actually.
>>>>
>>>> Look at the comments in the code, it tells you why it is this way:
>>>>
>>>>           // NOTE: we could have one buffer, but various consumers
>>>> (e.g. FieldComparatorSource)
>>>>           // assume "they" own the bytes after calling this!
>>>>
>>>> That is what we should fix. There is no need to make bulk APIs or even
>>>> change the public api in any way (other than javadocs).
>>>>
>>>> We just move the clone'ing out of the codec, and require the consumer
>>>> to do it, same as termsenum or other apis. The codec part is extremely
>>>> simple here, its even the way i had it initially.
>>>>
>>>> But at the time (and even still now) this comes with some risk of bugs.
>>>> So initially I removed the reuse and went with a more conservative approach
>>>> to start with.
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>> [email protected]> wrote:
>>>>
>>>>> Adrian,
>>>>>
>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>> ByteRef that provides 10% gain.
>>>>> ...
>>>>> bulkGet took:101630 ms
>>>>> ...
>>>>> get took:114422 ms
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[email protected]>wrote:
>>>>>
>>>>>> I don't think we should add such a method. Doc values are commonly
>>>>>> read from collectors, so why do we need a method that works on top of
>>>>>> a DocIdSetIterator? I'm also curious how specialized implementations
>>>>>> could make this method faster than the default implementation?
>>>>>>
>>>>>> --
>>>>>> Adrien
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Principal Engineer,
>>>>> Grid Dynamics
>>>>>
>>>>> <http://www.griddynamics.com>
>>>>>  <[email protected]>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Iterating BinaryDocValues

Reply via email to