[
https://issues.apache.org/jira/browse/LUCENE-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014944#comment-14014944
]
Uwe Schindler commented on LUCENE-5722:
---------------------------------------
bq. Also, if you get unlucky, the .dat could be smallish but span 2 chunks
anyway, e.g. if that segment used CFS format, and then specialization doesn't
kick in, right?
Of course, but on 64 bit operating system the chunk size is 1 GiB, so this will
happen more seldom. It is more likely that you have, as discusses, a very big
file and its completely mapped. I think we should do the improvement, but not
do too much to prevent this multi-case (otheriwse code gets more complicated an
error-prone).
In most cases, for a single segment and one single field the improvement will
kick in often enough. This is one reason why Robert said, we should maybe
investigate to use slice() for accessing the docvalues of a specific field,
instead of a full clone.
There might be another way to improve this to be a singleton, but it is too
hairy: We could do a fresh mapping on the slice. But this would need to also
unmap this fresh slice. And in addition, it consumes additional address space.
One thing that could be done here: If we know in advance, that we never need
the full file, we could mmap only a slice. Maybe we should offer
Directory#openInput(filename, offset, length) which could directly optimize for
the single buffer case??? [don't kill me about this suggestion, was just an
idea].
About the patch: I don't like "singleton" as term, because its closely related
to the pattern "singleton class instance". I would rename the single buffer one
to "SingleByteBufferIndexInput". The method singleton() is fine, I guess, just
the class name.
There is one thing, we might want to add an assert: In the single buffer case,
there is the slight chance to not catch an exception, if the cast from the seek
offset to int luckily gets into the valid slice area. Maybe we should not add a
hard check, but for our own safety while writing the code, we should maybe
check that the long offset is <= Integer.MAX_VALUE.
I like the idea of using ByteBuffer.slice(). Unfortunately (I am so unhappy!),
we cannot use this for the multi-buffer approach (because this would require
then more calculations on clone, which are now optimized to be bitshifts and
{{&}} only).
Also Eclipse warns if you call a static method from a subclass (if properly
configured). ByteBufferIndexInput.newClonesMap() should not be accessed as
MMapIndexInput.newCloneMap()... But thats just cosmetic - although it confused
me, too (I hate Java for allowing to access static methods with a different
class name, we should maybe make the warning a failure in Eclipse compiler
config at least).
In any case, we might improve the multi-buffer seek, too: if we already know
before that we will land in the same buffer - we could maybe do this in a small
check at the begining of the seek method: If we hit the same buffer, just do
curBuffer.position() and spare out the whole other stuff (which does many
assignments and additional checks). I will think about this a bit more...
> Speed up MMapDirectory.seek()
> -----------------------------
>
> Key: LUCENE-5722
> URL: https://issues.apache.org/jira/browse/LUCENE-5722
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-5722.patch
>
>
> For traditional lucene access which is mostly sequential, occasional
> advance(), I think this method gets drowned out in noise.
> But for access like docvalues, its important. Unfortunately seek() is complex
> today because of mapping multiple buffers.
> However, the very common case is that only one map is used for a given clone
> or slice.
> When there is the possibility to use only a single mapped buffer, we should
> instead take advantage of ByteBuffer.slice(), which will adjust the internal
> mmap address and remove the offset calculation. furthermore we don't need the
> shift/mask or even the negative check, as they are then all handled with the
> ByteBuffer api: seek is a one-liner (with try/catch of course to convert
> exceptions).
> This makes docvalues access 20% faster, I havent tested conjunctions or
> anyhting like that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]