Hi Mikhail,

These new byte[] are done on purpose because the contract on the
BytesRef class is that it is only a reference to bytes stored
somewhere, not a buffer. So it is always legal to change offset,
length and the bytes reference, but the content of the byte[] array
may only be changed if you own the BytesRef object. So for example, it
is legal for a BytesRefIterator to use a private BytesRef as a mutable
buffer and to expose it in calls to next() but it is illegal to modify
the content of the byte[] in BinaryDocValues.get since the bytes are
provided by consumers of the API.

If we would like to be able to remove these byte[] creations, I think
we would either need:
 1. to use a byte buffer instead of a BytesRef object,
 2. or to make the BytesRef objects private to the BinaryDocValues
instances and have a "BytesRef get(int docId)" method instead of "void
get(int docId, BytesRef ref)"

Although option 1 would avoid these byte[] creations, it would make
the paged-bytes-based implementations a bit slower (eg.
MemoryDocValuesFormat exposes binary doc values through paged bytes)
because they would need to copy bytes instead of just changing
pointers. With option 2, the BinaryDocValues object wouldn't be
thread-safe by nature anymore, but I don't think this is an issue as
we already store them in thread-locals.


On Thu, Jan 9, 2014 at 5:27 PM, Mikhail Khludnev
<[email protected]> wrote:
> Don't you think it's worth to raise a jira regarding those 'new bytes[]' ?
> I'm able to provide a patch if you wish.
>
>
> On Wed, Jan 8, 2014 at 2:02 PM, Mikhail Khludnev
> <[email protected]> wrote:
>>
>> FWIW,
>>
>> Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short
>> binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap
>> directory.
>> I wonder why it doesn't reads into incoming bytes
>> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>
>>
>>
>> On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless
>> <[email protected]> wrote:
>>>
>>> Going sequentially should help, if the pages are not hot (in the OS's IO
>>> cache).
>>>
>>> You can also use a different DVFormat, e.g. Direct, but this holds all
>>> bytes in RAM.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
>>> <[email protected]> wrote:
>>> > Joel,
>>> >
>>> > I tried to hack it straightforwardly, but found no free gain there. The
>>> > only
>>> > attempt I can suggest is to try to reuse bytes in
>>> >
>>> > https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>> > right now it allocates bytes every time, which beside of GC can also
>>> > impact
>>> > memory access locality. Could you try fix memory waste and repeat
>>> > performance test?
>>> >
>>> > Have a good hack!
>>> >
>>> >
>>> > On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[email protected]>
>>> > wrote:
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm looking for a faster way to perform large scale docId -> bytesRef
>>> >> lookups for BinaryDocValues.
>>> >>
>>> >> I'm finding that I can't get the performance that I need from the
>>> >> random
>>> >> access seek in the BinaryDocValues interface.
>>> >>
>>> >> I'm wondering if sequentially scanning the docValues would be a faster
>>> >> approach. I have a BitSet of matching docs, so if I sequentially moved
>>> >> through the docValues I could test each one against that bitset.
>>> >>
>>> >> Wondering if that approach would be faster for bulk extracts and how
>>> >> tricky it would be to add an iterator to the BinaryDocValues
>>> >> interface?
>>> >>
>>> >> Thanks,
>>> >> Joel
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Sincerely yours
>>> > Mikhail Khludnev
>>> > Principal Engineer,
>>> > Grid Dynamics
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to