RE: expensive post filtering of a query's result

Uwe Schindler Mon, 25 Nov 2013 13:54:56 -0800

Hi,

Lucene Filters are always executed before on the full index. This is done 
inside getDocIdSet(), which is similar to scorer() in Querys. Most filters 
return a bitset in this method, so they calculate the whole bitset on the full 
index - this is what your filter is doing. The strategy only applies to the 
consuming of the DocIdSetIterator and how it is interleaved with the query's 
scorer (which is also a DocIdSetIterator). Bitset based filters are already 
built, so there is no real speed difference. For leap-frog there is room to 
improve, as the approach advances the iterators until they stop on the same 
document. The only chance to do some type of "postfiltering" is using a 
leap-frog approach and implementing DocIdSetIterator.advance() in an efficient 
way. For FixedBitSet/OpenBitSet there is nothing you can do, as the matching 
bits were calculated before. FYI, while calling getDocIdSet(), Lucene only 
provides acceptDocs, if there are deleted documents in the index or another 
filter was executed before.


To do post-filtering (e.g., Solr allows this), you have to do this in the 
result Collector implementation. The collector is used for result collection 
and gets every document id that matches the query+ filters. In collector's 
collect(int docId) you can do the expensive post filtering, as collect() is 
only called for matching documents. A way to do this is to write a wrapper 
filter around the final collector and only delegating those collect() calls 
that match your post-filter.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Andreas Brandl [mailto:m...@3.141592654.de]
> Sent: Monday, November 25, 2013 9:03 PM
> To: java-user@lucene.apache.org
> Subject: expensive post filtering of a query's result
> 
> Hi,
> 
> I have a Query that is fast and cheap to answer compared to a Filter
> implementation that is quite expensive (* for a background see below).
> 
> I was under the impression that when combining Query and Filter, lucene is
> able to calculate matches based on the query and *for these matches*
> applies the Filter. Actually, the Filter touches every single document in the
> index.
> 
> My questions is: How do I apply an efficient 'post-filtering' step to a 
> query's
> result without touching all documents again?
> 
> I have tried like so:
> 
> <snip>
> Filter expensiveFilter = ... // my filter implementation; Query booleanQuery
> = ... // a BooleanQuery that is comparably cheap to answer Query query =
> new FilteredQuery(booleanQuery, expensiveFilter,
> FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
> isearcher.search(query, null, Integer.MAX_VALUE); </snip>
> 
> I have also verified that the FilteredQuery.QUERY_FIRST_FILTER_STRATEGY
> strategy is actually used (i.e. no fallback to LEAP_FROG) like mentioned in 
> the
> docs:
> 
>   /**
>    * A filter strategy that advances the Query or rather its {@link Scorer} 
> first
> and consults the
>    * filter {@link DocIdSet} for each matched document.
>    * <p>
>    * Note: this strategy requires a {@link DocIdSet#bits()} to return a 
> non-null
> value. Otherwise
>    * this strategy falls back to {@link
> FilteredQuery#LEAP_FROG_QUERY_FIRST_STRATEGY}
>    * </p>
>    * <p>
>    * Use this strategy if the filter computation is more expensive than
> document
>    * scoring or if the filter has a linear running time to compute the next
>    * matching doc like exact geo distances.
>    * </p>
>    */
> 
> For me, it reads like this strategy fits exactly my use case but there is 
> clearly
> something I'm missing here, so any help/comments appreciated a lot.
> 
> Please see my Filter implementation attached if that is of interest (something
> terribly wrong there?). I'm not getting any acceptDocs, so acceptDocs is
> always null (which I thought was the way the query result gets propagated to
> the Filter).
> 
> * The background is:
> 
> I'm implementing a trigram index with lucene for regex search based on [1]
> which I'm going to evaluate against other regex search solutions (including
> Lucene's AutomatonQuery/RegexpQuery).
> 
> The lucene part boils down to having a BooleanQuery (containing trigrams
> like e.g. "OR(AND(hel,ell,llo), AND(wor,orl,rld))") which produces a candidate
> set, i.e. a superset of all actually matching documents. The last step is to
> verify each candidate, i.e. really match the document's content against the
> regex pattern. Obviously the goal is to reduce the candidate set as much as
> possible (via the BooleanQuery) and do the more expensive regex matching
> on as little documents as possible.
> 
> I'm on lucene 4.6.
> 
> Thank you,
> 
> Regards
> Andreas
> 
> [1] http://swtch.com/~rsc/regexp/regexp4.html


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: expensive post filtering of a query's result

Reply via email to