[jira] [Commented] (LUCENE-5722) Speed up MMapDirectory.seek()

Uwe Schindler (JIRA) Sun, 01 Jun 2014 02:35:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014944#comment-14014944
 ]


Uwe Schindler commented on LUCENE-5722:
---------------------------------------

bq. Also, if you get unlucky, the .dat could be smallish but span 2 chunks 
anyway, e.g. if that segment used CFS format, and then specialization doesn't 
kick in, right?

Of course, but on 64 bit operating system the chunk size is 1 GiB, so this will 
happen more seldom. It is more likely that you have, as discusses, a very big 
file and its completely mapped. I think we should do the improvement, but not 
do too much to prevent this multi-case (otheriwse code gets more complicated an 
error-prone).

In most cases, for a single segment and one single field the improvement will 
kick in often enough. This is one reason why Robert said, we should maybe 
investigate to use slice() for accessing the docvalues of a specific field, 
instead of a full clone.

There might be another way to improve this to be a singleton, but it is too 
hairy: We could do a fresh mapping on the slice. But this would need to also 
unmap this fresh slice. And in addition, it consumes additional address space. 
One thing that could be done here: If we know in advance, that we never need 
the full file, we could mmap only a slice. Maybe we should offer 
Directory#openInput(filename, offset, length) which could directly optimize for 
the single buffer case??? [don't kill me about this suggestion, was just an 
idea].

About the patch: I don't like "singleton" as term, because its closely related 
to the pattern "singleton class instance". I would rename the single buffer one 
to "SingleByteBufferIndexInput". The method singleton() is fine, I guess, just 
the class name.

There is one thing, we might want to add an assert: In the single buffer case, 
there is the slight chance to not catch an exception, if the cast from the seek 
offset to int luckily gets into the valid slice area. Maybe we should not add a 
hard check, but for our own safety while writing the code, we should maybe 
check that the long offset is <= Integer.MAX_VALUE.

I like the idea of using ByteBuffer.slice(). Unfortunately (I am so unhappy!), 
we cannot use this for the multi-buffer approach (because this would require 
then more calculations on clone, which are now optimized to be bitshifts and 
{{&}} only).

Also Eclipse warns if you call a static method from a subclass (if properly 
configured). ByteBufferIndexInput.newClonesMap() should not be accessed as 
MMapIndexInput.newCloneMap()... But thats just cosmetic - although it confused 
me, too (I hate Java for allowing to access static methods with a different 
class name, we should maybe make the warning a failure in Eclipse compiler 
config at least).

In any case, we might improve the multi-buffer seek, too: if we already know 
before that we will land in the same buffer - we could maybe do this in a small 
check at the begining of the seek method: If we hit the same buffer, just do 
curBuffer.position() and spare out the whole other stuff (which does many 
assignments and additional checks). I will think about this a bit more...

> Speed up MMapDirectory.seek()
> -----------------------------
>
>                 Key: LUCENE-5722
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5722
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-5722.patch
>
>
> For traditional lucene access which is mostly sequential, occasional 
> advance(), I think this method gets drowned out in noise.
> But for access like docvalues, its important. Unfortunately seek() is complex 
> today because of mapping multiple buffers.
> However, the very common case is that only one map is used for a given clone 
> or slice.
> When there is the possibility to use only a single mapped buffer, we should 
> instead take advantage of ByteBuffer.slice(), which will adjust the internal 
> mmap address and remove the offset calculation. furthermore we don't need the 
> shift/mask or even the negative check, as they are then all handled with the 
> ByteBuffer api: seek is a one-liner (with try/catch of course to convert 
> exceptions).
> This makes docvalues access 20% faster, I havent tested conjunctions or 
> anyhting like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5722) Speed up MMapDirectory.seek()

Reply via email to