[jira] [Comment Edited] (LUCENE-5722) Speed up MMapDirectory.seek()

Uwe Schindler (JIRA) Sun, 01 Jun 2014 04:23:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014954#comment-14014954
 ]


Uwe Schindler edited comment on LUCENE-5722 at 6/1/14 11:21 AM:
----------------------------------------------------------------

bq. Sorry Uwe, we shouldn't add extra conditionals to seek here, its the entire 
point of the issue!!!!!!!!

Can you please quote, what you are referring to? Thanks.

Do you mean that one:
bq. There is one thing, we might want to add an assert: In the single buffer 
case, there is the slight chance to not catch an exception, if the cast from 
the seek offset to int luckily gets into the valid slice area. Maybe we should 
not add a hard check, but for our own safety while writing the code, we should 
maybe check that the long offset is <= Integer.MAX_VALUE.

(this is just an assert to catch bugs while testing, I dont want to add a hard 
check.)

or that one:
bq. In any case, we might improve the multi-buffer seek, too: if we already 
know before that we will land in the same buffer - we could maybe do this in a 
small check at the begining of the seek method: If we hit the same buffer, just 
do curBuffer.position() and spare out the whole other stuff (which does many 
assignments and additional checks). I will think about this a bit more...

Don't get me wrong, I was just adding ideas. I know what this issue is about, 
but sometimes one additional check that triggers the "common case" is better 
than having crazy code executed on every call. My idea is basically, we can 
throw it away:

Under the assumption that we seek in most cases not accross byte buffers, the 
overhead of seek could be completely optimized away with one additional check: 
we may add another "state" variable with the (long) start offset of the current 
buffer (curBufOffset - unfortunately, this is one more state field that we need 
to maintain). On seek we maybe check that seek's position parameter minus 
current buffers start offset is positive and less than Integer.MAX_VALUE away - 
this is one check: {{(difference&Integer.MAX_VALUE)==difference}}. If this is 
the case, we can directly optimize to only call position() on the current 
buffer. If any Exception happens we fall-through to our dumb seek (because it 
is not always the EOF case).

Instead of throwing that away because you don't like it, I wanted to do a 
benchmark. Don't just say: "this is bullshit" I try hard to think about an 
optimization of the cross-buffer seek, too. There must be a way to do this! 
Maybe you have another idea instead of mine, but please don't tell me that I 
don't know what I am talking about! In the ideal case I would like to do no 
checks at all and let ByteBuffer throw exceptions if we seek outside the 
current buffer. But this is, as far as I know, impossible because of the cast 
from long to int (it would be nice to have something like the x86 carry-flag 
(http://en.wikipedia.org/wiki/Carry_flag) in Java, too, so you could detect 
overflow while casting long->int)..

We can also maybe move the negative check on the beginning of seek down or do 
it more intelligent?


was (Author: thetaphi):
bq. Sorry Uwe, we shouldn't add extra conditionals to seek here, its the entire 
point of the issue!!!!!!!!

Can you please quote, what you are referring to? Thanks.

Do you mean that one:
bq. There is one thing, we might want to add an assert: In the single buffer 
case, there is the slight chance to not catch an exception, if the cast from 
the seek offset to int luckily gets into the valid slice area. Maybe we should 
not add a hard check, but for our own safety while writing the code, we should 
maybe check that the long offset is <= Integer.MAX_VALUE.

(this is just an assert to catch bugs while testing, I dont want to add a hard 
check.)

or that one:
bq. In any case, we might improve the multi-buffer seek, too: if we already 
know before that we will land in the same buffer - we could maybe do this in a 
small check at the begining of the seek method: If we hit the same buffer, just 
do curBuffer.position() and spare out the whole other stuff (which does many 
assignments and additional checks). I will think about this a bit more...

Don't get me wrong, I was just adding ideas. I know what this issue is about, 
but sometimes one additional check that triggers the "common case" is better 
than having crazy code executed on every call. My idea is basically, we can 
throw it away:

Under the assumption that we seek in most cases not accross byte buffers, the 
overhead of seek could be completely optimized away with one additional check: 
we may add another "state" variable with the (long) start offset of the current 
buffer (curBufOffset - unfortunately, this is one more state field that we need 
to maintain). On seek we maybe check that seek's position parameter minus 
current buffers start offset is positive and less than Integer.MAX_VALUE away - 
this is one check: {{(difference&Integer.MAX_VALUE)==(long)Integer.MAX_VALUE}}. 
If this is the case, we can directly optimize to only call position() on the 
current buffer. If any Exception happens we fall-through to our dumb seek 
(because it is not always the EOF case).

Instead of throwing that away because you don't like it, I wanted to do a 
benchmark. Don't just say: "this is bullshit" I try hard to think about an 
optimization of the cross-buffer seek, too. There must be a way to do this! 
Maybe you have another idea instead of mine, but please don't tell me that I 
don't know what I am talking about! In the ideal case I would like to do no 
checks at all and let ByteBuffer throw exceptions if we seek outside the 
current buffer. But this is, as far as I know, impossible because of the cast 
from long to int (it would be nice to have something like the x86 carry-flag 
(http://en.wikipedia.org/wiki/Carry_flag) in Java, too, so you could detect 
overflow while casting long->int)..

We can also maybe move the negative check on the beginning of seek down or do 
it more intelligent?

> Speed up MMapDirectory.seek()
> -----------------------------
>
>                 Key: LUCENE-5722
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5722
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-5722.patch
>
>
> For traditional lucene access which is mostly sequential, occasional 
> advance(), I think this method gets drowned out in noise.
> But for access like docvalues, its important. Unfortunately seek() is complex 
> today because of mapping multiple buffers.
> However, the very common case is that only one map is used for a given clone 
> or slice.
> When there is the possibility to use only a single mapped buffer, we should 
> instead take advantage of ByteBuffer.slice(), which will adjust the internal 
> mmap address and remove the offset calculation. furthermore we don't need the 
> shift/mask or even the negative check, as they are then all handled with the 
> ByteBuffer api: seek is a one-liner (with try/catch of course to convert 
> exceptions).
> This makes docvalues access 20% faster, I havent tested conjunctions or 
> anyhting like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5722) Speed up MMapDirectory.seek()

Reply via email to