Hi Igor,
About your performance problem with SpanQueries and Payloads:
Try to filter with the corresponding BooleanQuery and use a profiler.
You have an IO-bottleneck because of reading position and payload
information per document.
Possible it would help if you first filter off the "obviously" no hits.
"Obviously" documents without all the search-terms from the SpanQuerie
are no hits.
So we don't need the term position for documents which do not match all
search-terms even without the position information. But SpanQueries read
this information even for this "obviously" no hits
SpanQueries like PhraseQueries are implicit must-BooleanQueries.
But SpanQueries directly read the position information of the term for
each document
(PhraseQueries first check, that all terms belongs to the document).
So it could help if you make the implicit BooleanQuery explicit. First
collect the hits of the BooleanQuery and then search with the SpanQuery
only inside this collection (use DocIdSet as Filter).
If this does not help use a profiler and ask again ;-)
Best regards,
Karsten
ps. in context:
http://lucene.472066.n3.nabble.com/Under-the-hood-of-SpanQueries-td4053638.html
On 04/03/2013 11:55 PM, Igor Shalyminov wrote:
Hi all!
I have a ~20GB index of documents that have words with several attributes
associated with them, e.g.:
WORD: word_1 word_2 ... word_n
POS: pos1_1:pos1_2:pos1:3 pos2 ... pos_n_1:pos_n_2
LEMMA: lemma1_1:lemma1:2:lemma1_3 lemma2 lemma_n_1:lemma_n_2
Field tokens separated by ':' are ambiguous, i.e. they correspond to the same
position in the document.
An important detail of ambiguous word attributes is that, e.g., pos1_1 corresponds
only to lemma1_1, not to lemma1_2 or 1_3, so one must not match word_1 when
searching for pos1_1 & lemma1_3 at the same position.
I handle ambiguous tokens position with standard positionIncrement = 0, and
attribute number correspondence with token payloads. Say, lemma1_1 has payload
= 1, lemma1_2 - 2; pos1_1 - 1, pos1_2 - 2, and so on. And while searching for
token attributes at the same position I use payload filter that checks if the
payloads of all tokens matched are the same.
And that's it: SpanNearQueries run super slow on that index (10's of seconds,
and the majority of indexed documents matches to a common query).
I don't know actually how SpanQueries work in-depth, but is there some
inefficiency in them by design? Or is payload retrieval so expensive?
I'm just wondering if I'm missing something obvious that slows down the entire
search.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org