what are you storing in the binaryDV?

On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[email protected]> wrote:

> For the test I ran, I just timed the number of docId->bytesRef lookups I
> could do in a second.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[email protected]> wrote:
>
>> Are you sure its not the wire serialization etc causing the bottleneck
>> (e.g. converting to utf-8 string and back, network traffic, json encoding,
>> etc etc?)
>>
>>
>> On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[email protected]>wrote:
>>
>>> Bulk extracting full unsorted result sets from Solr. You give Solr a
>>> query and it dumps the full result in a single call. The result set
>>> streaming is in place, but throughput is not as good as I would like it.
>>>
>>> Joel Bernstein
>>> Search Engineer at Heliosearch
>>>
>>>
>>> On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[email protected]> wrote:
>>>
>>>> what are you doing with the data?
>>>>
>>>>
>>>> On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[email protected]>wrote:
>>>>
>>>>> I'll provide a little more context. I'm working on bulk extracting
>>>>> BinaryDocValues. My initial performance test was with in-memory
>>>>> binaryDocValues, but I think the end game is actually disk-based
>>>>> binaryDocValues.
>>>>>
>>>>> I was able to perform around 1 million docId->BytesRef lookups
>>>>> per-second with in-memory BinaryDocValues. Since I need to get the values
>>>>> for multiple fields for each document, this bogs down pretty quickly.
>>>>>
>>>>> I'm wondering if there is a way to increase this throughput. Since
>>>>> filling a BytesRef is pretty fast, I was assuming it was the seek that was
>>>>> taking the time, but I didn't verify this. The first thing that came to
>>>>> mind is iterating the docValues in such a way that the next docValue could
>>>>> be loaded without a seek. But I haven't dug into how the BinaryDocValues
>>>>> are formatted so I'm not sure if this would help or not. Also there could
>>>>> be something else besides the seek that is limiting the throughput.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Joel Bernstein
>>>>> Search Engineer at Heliosearch
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[email protected]> wrote:
>>>>>
>>>>>> Yeah, i dont think its from newer docvalues-using code like yours
>>>>>> shai.
>>>>>>
>>>>>> instead the problems i had doing this are historical, because e.g.
>>>>>> fieldcache pointed to large arrays and consumers were lazy about it,
>>>>>> knowing that there reference pointed to bytes that would remain valid
>>>>>> across invocations.
>>>>>>
>>>>>> we just have to remove these assumptions. I don't apologize for not
>>>>>> doing this, as you show, its some small % improvement (which we should go
>>>>>> and get back!), but i went with safety first initially rather than bugs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[email protected]> wrote:
>>>>>>
>>>>>>> I agree with Robert. We should leave cloning BytesRefs to whoever
>>>>>>> needs that, and not penalize everyone else who don't need it. I must 
>>>>>>> say I
>>>>>>> didn't know I can "own" those BytesRefs and I clone them whenever I need
>>>>>>> to. I think I was bitten by one of the other APIs, so I assumed returned
>>>>>>> BytesRefs are not "mine" across all the APIs.
>>>>>>>
>>>>>>> Shai
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[email protected]>wrote:
>>>>>>>
>>>>>>>> the problem is really simpler to solve actually.
>>>>>>>>
>>>>>>>> Look at the comments in the code, it tells you why it is this way:
>>>>>>>>
>>>>>>>>           // NOTE: we could have one buffer, but various consumers
>>>>>>>> (e.g. FieldComparatorSource)
>>>>>>>>           // assume "they" own the bytes after calling this!
>>>>>>>>
>>>>>>>> That is what we should fix. There is no need to make bulk APIs or
>>>>>>>> even change the public api in any way (other than javadocs).
>>>>>>>>
>>>>>>>> We just move the clone'ing out of the codec, and require the
>>>>>>>> consumer to do it, same as termsenum or other apis. The codec part is
>>>>>>>> extremely simple here, its even the way i had it initially.
>>>>>>>>
>>>>>>>> But at the time (and even still now) this comes with some risk of
>>>>>>>> bugs. So initially I removed the reuse and went with a more 
>>>>>>>> conservative
>>>>>>>> approach to start with.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Adrian,
>>>>>>>>>
>>>>>>>>> Please find bulkGet() scratch. It's ugly copy-paste, just reuses
>>>>>>>>> ByteRef that provides 10% gain.
>>>>>>>>> ...
>>>>>>>>> bulkGet took:101630 ms
>>>>>>>>> ...
>>>>>>>>> get took:114422 ms
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> I don't think we should add such a method. Doc values are commonly
>>>>>>>>>> read from collectors, so why do we need a method that works on
>>>>>>>>>> top of
>>>>>>>>>> a DocIdSetIterator? I'm also curious how specialized
>>>>>>>>>> implementations
>>>>>>>>>> could make this method faster than the default implementation?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Adrien
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sincerely yours
>>>>>>>>> Mikhail Khludnev
>>>>>>>>> Principal Engineer,
>>>>>>>>> Grid Dynamics
>>>>>>>>>
>>>>>>>>> <http://www.griddynamics.com>
>>>>>>>>>  <[email protected]>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to