Re: Under the hood of SpanQueries

Karsten F. Thu, 11 Apr 2013 12:38:22 -0700

Hi Igor,
About your performance problem with SpanQueries and Payloads:
Try to filter with the corresponding BooleanQuery and use a profiler.

You have an IO-bottleneck because of reading position and payloadinformation per document.

Possible it would help if you first filter off the "obviously" no hits.

"Obviously" documents without all the search-terms from the SpanQuerieare no hits.So we don't need the term position for documents which do not match allsearch-terms even without the position information. But SpanQueries readthis information even for this "obviously" no hits

SpanQueries like PhraseQueries are implicit must-BooleanQueries.

But SpanQueries directly read the position information of the term foreach document

(PhraseQueries first check, that all terms belongs to the document).

So it could help if you make the implicit BooleanQuery explicit. Firstcollect the hits of the BooleanQuery and then search with the SpanQueryonly inside this collection (use DocIdSet as Filter).

If this does not help use a profiler and ask again ;-)
Best regards,
Karsten

ps. in context:http://lucene.472066.n3.nabble.com/Under-the-hood-of-SpanQueries-td4053638.html


On 04/03/2013 11:55 PM, Igor Shalyminov wrote:

Hi all!

I have a ~20GB index of documents that have words with several attributes 
associated with them, e.g.:

WORD: word_1 word_2 ... word_n
POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2

Field tokens separated by ':' are ambiguous, i.e. they correspond to the same 
position in the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds 
only to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when 
searching for pos1_1 & lemma1_3 at the same position.

I handle ambiguous tokens position with standard positionIncrement = 0, and 
attribute number correspondence with token payloads. Say, lemma1_1 has payload 
= 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for 
token attributes at the same position I use payload filter that checks if the 
payloads of all tokens matched are the same.

And that's it: SpanNearQueries run super slow on that index (10's of seconds, 
and the majority of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some 
inefficiency in them by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire 
search.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Under the hood of SpanQueries

Reply via email to