Hi Igor,
About your performance problem with SpanQueries and Payloads:
Try to filter with the corresponding BooleanQuery and use a profiler.
You have an IO-bottleneck because of reading position and payload information per document.
Possible it would help if you first filter off the "obviously" no hits.
"Obviously" documents without all the search-terms from the SpanQuerie are no hits. So we don't need the term position for documents which do not match all search-terms even without the position information. But SpanQueries read this information even for this "obviously" no hits
SpanQueries like PhraseQueries are implicit must-BooleanQueries.
But SpanQueries directly read the position information of the term for each document
(PhraseQueries first check, that all terms belongs to the document).
So it could help if you make the implicit BooleanQuery explicit. First collect the hits of the BooleanQuery and then search with the SpanQuery only inside this collection (use DocIdSet as Filter).
If this does not help use a profiler and ask again ;-)
Best regards,
Karsten

ps. in context: http://lucene.472066.n3.nabble.com/Under-the-hood-of-SpanQueries-td4053638.html

On 04/03/2013 11:55 PM, Igor Shalyminov wrote:
Hi all!

I have a ~20GB index of documents that have words with several attributes 
associated with them, e.g.:

WORD: word_1 word_2 ... word_n
POS:    pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2

Field tokens separated by ':' are ambiguous, i.e. they correspond to the same 
position in the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds 
only to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when 
searching for pos1_1 & lemma1_3 at the same position.

I handle ambiguous tokens position with standard positionIncrement = 0, and 
attribute number correspondence with token payloads. Say, lemma1_1 has payload 
= 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for 
token attributes at the same position I use payload filter that checks if the 
payloads of all tokens matched are the same.

And that's it: SpanNearQueries run super slow on that index (10's of seconds, 
and the majority of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some 
inefficiency in them by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire 
search.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to