Re: Lucene in-memory index

Igor Shalyminov Tue, 22 Oct 2013 06:44:33 -0700

Hello Mike!


19.10.2013, 14:54, "Michael McCandless" <luc...@mikemccandless.com>:
> On Fri, Oct 18, 2013 at 5:50 PM, Igor Shalyminov
> <ishalymi...@yandex-team.ru> wrote:
>
>>  But why is it so costly?
>
> I think because the matching is inherently complex?  But also because
> it does high-cost things like allocating new List and Set for every
> matched doc (e.g. NearSpansOrdered.shrinkToAfterShortestMatch) to hold
> all payloads it encountered within each span. Patches welcome!
>
>>  In a regular query we walk postings and match document numbers, in a 
>> SpanQuery we match position numbers (or position segments), what's the 
>> principal difference?
>>  I think it's just that #documents << #positions.
>
> Conceptually, that's right, we just need to decode "more ints" (and
> also the payloads), but then need to essentially merge-sort the
> positions of N terms, and then "coalesce" them into spans, is at heart
> rather costly.  Lots of hard-for-CPU-to-predict branches...
>
> But I suspect we could get some good speedups on span queries with a
> better implementation;
> https://issues.apache.org/jira/browse/LUCENE-2878 is [slowly]
> exploring making positions "first class" in Scorer, so you can iterate
> over position + payload for each hit.

Thanks for the link, I'll definitely dig into SpanQuery internals very soon.

>
>>  For "A,sg" and "A,pl" I use unordered SpanNearQueries with the slop=-1.
>
> I didn't even realize you could pass negative slop to span queries.
> What does that do?  Or did you mean slop=1?

I indeed use an unordered SpanNearQuery with the slop = --1 (I saw it on some 
forum, maybe here: 
http://www.gossamer-threads.com/lists/lucene/java-user/89377?do=post_view_flat#89377)
So far it works for me:)

>
>>  I wrap them into an ordered SpanNearQuery with the slop=0.
>>
>>  I see getPayload() in the profiler top. I think I can emulate payload 
>> checking with cleverly assigned position increments (and then maximum 
>> position in a document might jump up to ~10^9 - I hope it won't blow the 
>> whole index up).
>>
>>  If I remove payload matching and keep only position checking, will it speed 
>> up everything, or the positions and payloads are the same?
>
> I think it would help to avoid payloads, but I'm not sure by how much.
>  E.g., I see that NearSpansOrdered creates a new Set for every hit
> just to hold payloads, even if payloads are not going to be used.
> Really the span scorers should check Terms.hasPayloads up front ...
>
>>  My main goal is getting the precise results for a query, so proximity 
>> boosting won't help, unfortunately.
>
> OK.
>
> I wonder if you can somehow identify the spans you care about at
> indexing time, e.g. A,sg followed by N,sg and e.g. add a span into the
> index at that point; this would make searching much faster (it becomes
> a TermQuery).  For exact matching (slop=0) you can also index
> shingles.

Thanks for the clue, I think it can be a good optimization heuristic.
I actually tried a similar approach to optimize search of attributes at the 
same position.
Here's how it was supposed to work for a feature set "S,sg,nom,fem":

* the regular approach: split it into grammar atomics: "S", "sg", "nom", "fem". 
With payloads and positions assigned the right way, this would allow us to 
search for an arbitrary combination of these attributes _but_ with multiple 
postings merging.
* the experimental approach: sort the atomics lexicographically and index all 
the subsets: "S", "fem", "nom", "sg", "S,fem", "S,nom", ..., "S,fem,nom,sg". 
With the preprocessing of the user query the same way (split - sort - join) it 
would allow us to process the same queries exactly within one posting.

This technique is actually used in our current production index based on 
Yandex.Server engine.
But Yandex.Server somehow makes the index size reasonable (within the order of 
magnitude of original text size), while Lucene index blows up totally ( >10 
times original text size) and no search performance improvements appear.

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene in-memory index

Reply via email to