Re: Lucene in-memory index

Igor Shalyminov Fri, 18 Oct 2013 14:51:29 -0700

But why is it so costly?

In a regular query we walk postings and match document numbers, in a SpanQuery 
we match position numbers (or position segments), what's the principal 
difference?
I think it's just that #documents << #positions.


For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1. I wrap 
them into an ordered SpanNearQuery with the slop=0.

I see getPayload() in the profiler top. I think I can emulate payload checking 
with cleverly assigned position increments (and then maximum position in a 
document might jump up to ~10^9 - I hope it won't blow the whole index up).
If I remove payload matching and keep only position checking, will it speed up 
everything, or the positions and payloads are the same?

My main goal is getting the precise results for a query, so proximity boosting 
won't help, unfortunately.


-- 
Best Regards,
Igor

18.10.2013, 23:37, "Michael McCandless" <luc...@mikemccandless.com>:
> Unfortunately, SpanNearQuery is a very costly query.  What slop are you 
> passing?
>
> You might want to check out
> https://issues.apache.org/jira/browse/LUCENE-5288 ... it adds
> proximity boosting to queries, but it's still very early in the
> iterating, and if you need a precise count of only those documents
> matching the SpanNearQuery, then that issue won't help.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Oct 17, 2013 at 6:05 PM, Igor Shalyminov
> <ishalymi...@yandex-team.ru> wrote:
>
>>  Mike,
>>
>>  For now I'm using just a SpanQuery over a ~600MB index segment 
>> single-threadedly (one segment - one thread, the complete setup is 30 
>> segments with the total of 20GB).
>>
>>  I'm trying to use Lucene for the morphologically annotated text corpus 
>> (namely, Russian National Corpus).
>>  The main query type in it is co-occurrence search with desired word 
>> morphological features and distance between tokens.
>>
>>  In my test case I work with a single field - grammar (it is word-level - 
>> every word in the corpus has one). Full grammar annotation of a word is a 
>> set of atomic grammar features.
>>  For an example, the verb "book" has in its grammar:
>>  - POS  tag (V);
>>  - time (pres);
>>
>>  and the noun "book":
>>  - POS tag (N)
>>  - number (sg).
>>
>>  In general one grammar annotation has approximately 8 atomic features.
>>
>>  Words are treated as initially ambiguous, so that for the word "book" 
>> occurrence in the text we get grammar tokens:
>>  V    pres    N    sg
>>  2 parses: "V,pres" and "N,sg" are just independent tokens with 
>> positionIncrement=0 in the index.
>>
>>  Moreover, each such token has parse bitmask in its payload:
>>  V|0001    pres|0001    N|0010    sg|0010
>>
>>  Here, V and pres appeared in the 1st parse; N and sg in the 2nd with the 
>> maximum of 4 parse variants. It allows me to find the word "book" for the 
>> query "V" & "pres" but not for the query "V" & "sg".
>>
>>  So, I'm performing a SpanNearQuery "{"A,sg" that goes right before "N,sg"} 
>> with position and payload checking over a 600MB segment and getting the 
>> precise doc hits number and overall matches number via iterating over 
>> getSpans().
>>
>>  This takes me about 20 seconds, even if everything is in RAM.
>>  The next thing I'm going to explore is compression, I'll try 
>> DirectPostingsFormat as you suggested.
>>
>>  --
>>  Best Regards,
>>  Igor
>>
>>  17.10.2013, 20:26, "Michael McCandless" <luc...@mikemccandless.com>:
>>>  DirectPostingsFormat holds all postings in RAM, uncompressed, as
>>>  simple java arrays.  But it's quite RAM heavy...
>>>
>>>  The hotspots may also be in the queries you are running ... maybe you
>>>  can describe more how you're using Lucene?
>>>
>>>  Mike McCandless
>>>
>>>  http://blog.mikemccandless.com
>>>
>>>  On Thu, Oct 17, 2013 at 10:56 AM, Igor Shalyminov
>>>  <ishalymi...@yandex-team.ru> wrote:
>>>>   Hello!
>>>>
>>>>   I've tried two approaches: 1) RAMDirectory, 2) MMapDirectory + tmpfs. 
>>>> Both work the same for me (the same bad:( ).
>>>>   Thus, I think my problem is not disk access (although I always see 
>>>> getPayload() in the VisualVM top).
>>>>   So, maybe the hard part in the postings traversal is decompression?
>>>>   Are there Lucene codecs which use light postings compression (maybe none 
>>>> at all)?
>>>>
>>>>   And, getting back to in-memory index topic, is lucene.codecs.memory 
>>>> somewhat similar to RAMDirectory?
>>>>
>>>>   --
>>>>   Best Regards,
>>>>   Igor
>>>>
>>>>   10.10.2013, 03:01, "Vitaly Funstein" <vfunst...@gmail.com>:
>>>>>   I don't think you want to load indexes of this size into a RAMDirectory.
>>>>>   The reasons have been listed multiple times here... in short, just use
>>>>>   MMapDirectory.
>>>>>
>>>>>   On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
>>>>>   <ishalymi...@yandex-team.ru>wrote:
>>>>>>    Hello!
>>>>>>
>>>>>>    I need to perform an experiment of loading the entire index in RAM and
>>>>>>    seeing how the search performance changes.
>>>>>>    My index has TermVectors with payload and position info, 
>>>>>> StoredFields, and
>>>>>>    DocValues. It takes ~30GB on disk (the server has 48).
>>>>>>
>>>>>>    _indexDirectoryReader = DirectoryReader.open(RAMDirectory.open(new
>>>>>>    File(_indexDirectory)));
>>>>>>
>>>>>>    Is the line above the only thing I have to do to complete my goal?
>>>>>>
>>>>>>    And also:
>>>>>>    - will all the data be loaded in the RAM right after opening, or 
>>>>>> during
>>>>>>    the reading stage?
>>>>>>    - will the index data be stored in RAM as it is on disk, or will it be
>>>>>>    uncompressed first?
>>>>>>
>>>>>>    --
>>>>>>    Best Regards,
>>>>>>    Igor
>>>>>>
>>>>>>    ---------------------------------------------------------------------
>>>>>>    To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>    For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>   ---------------------------------------------------------------------
>>>>   To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>   For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>  ---------------------------------------------------------------------
>>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>  For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene in-memory index

Reply via email to