Hi,
I have a Query that is fast and cheap to answer compared to a Filter
implementation that is quite expensive (* for a background see below).
I was under the impression that when combining Query and Filter, lucene is able
to calculate matches based on the query and *for these matches* applies the
Filter. Actually, the Filter touches every single document in the index.
My questions is: How do I apply an efficient 'post-filtering' step to a query's
result without touching all documents again?
I have tried like so:
<snip>
Filter expensiveFilter = ... // my filter implementation;
Query booleanQuery = ... // a BooleanQuery that is comparably cheap to answer
Query query = new FilteredQuery(booleanQuery, expensiveFilter,
FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
isearcher.search(query, null, Integer.MAX_VALUE);
</snip>
I have also verified that the FilteredQuery.QUERY_FIRST_FILTER_STRATEGY
strategy is actually used (i.e. no fallback to LEAP_FROG) like mentioned in the
docs:
/**
* A filter strategy that advances the Query or rather its {@link Scorer}
first and consults the
* filter {@link DocIdSet} for each matched document.
* <p>
* Note: this strategy requires a {@link DocIdSet#bits()} to return a
non-null value. Otherwise
* this strategy falls back to {@link
FilteredQuery#LEAP_FROG_QUERY_FIRST_STRATEGY}
* </p>
* <p>
* Use this strategy if the filter computation is more expensive than document
* scoring or if the filter has a linear running time to compute the next
* matching doc like exact geo distances.
* </p>
*/
For me, it reads like this strategy fits exactly my use case but there is
clearly something I'm missing here, so any help/comments appreciated a lot.
Please see my Filter implementation attached if that is of interest (something
terribly wrong there?). I'm not getting any acceptDocs, so acceptDocs is always
null (which I thought was the way the query result gets propagated to the
Filter).
* The background is:
I'm implementing a trigram index with lucene for regex search based on [1]
which I'm going to evaluate against other regex search solutions (including
Lucene's AutomatonQuery/RegexpQuery).
The lucene part boils down to having a BooleanQuery (containing trigrams like
e.g. "OR(AND(hel,ell,llo), AND(wor,orl,rld))") which produces a candidate set,
i.e. a superset of all actually matching documents. The last step is to verify
each candidate, i.e. really match the document's content against the regex
pattern. Obviously the goal is to reduce the candidate set as much as possible
(via the BooleanQuery) and do the more expensive regex matching on as little
documents as possible.
I'm on lucene 4.6.
Thank you,
Regards
Andreas
[1] http://swtch.com/~rsc/regexp/regexp4.html
import java.io.IOException;
public class PatternFilter extends Filter {
private final Pattern pattern;
private final String field;
public PatternFilter(Pattern regex, String field) {
checkArgument(!field.isEmpty());
this.pattern = checkNotNull(regex);
this.field = checkNotNull(field);
}
public PatternFilter(String regex, String field) {
this(Pattern.compile(regex), field);
}
@Override
public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {
return correctBits(context.reader(), acceptDocs);
}
private FixedBitSet correctBits(AtomicReader reader, Bits acceptDocs) throws IOException {
FixedBitSet bits = new FixedBitSet(reader.maxDoc()); // assume all are
// INvalid
Bits liveDocs = reader.getLiveDocs();
Set<String> fieldsToLoad = new HashSet<String>();
fieldsToLoad.add(field);
for (int docID = 0; docID < reader.maxDoc(); docID++) {
if (liveDocs != null && !(liveDocs.get(docID))) {
// document is not alive anymore...
continue;
}
if (acceptDocs != null && !(acceptDocs.get(docID))) {
continue;
}
// expensive?
Document document = reader.document(docID, fieldsToLoad);
String content = document.get(field);
if (content == null) {
// field is not present
continue;
}
// expensive!
if (pattern.matcher(content).find()) {
bits.set(docID);
}
}
return bits;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org