expensive post filtering of a query's result

Andreas Brandl Mon, 25 Nov 2013 12:04:12 -0800

Hi,

I have a Query that is fast and cheap to answer compared to a Filter 
implementation that is quite expensive (* for a background see below).


I was under the impression that when combining Query and Filter, lucene is able 
to calculate matches based on the query and *for these matches* applies the 
Filter. Actually, the Filter touches every single document in the index.

My questions is: How do I apply an efficient 'post-filtering' step to a query's 
result without touching all documents again?

I have tried like so:

<snip>
Filter expensiveFilter = ... // my filter implementation;
Query booleanQuery = ... // a BooleanQuery that is comparably cheap to answer
Query query = new FilteredQuery(booleanQuery, expensiveFilter, 
FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
isearcher.search(query, null, Integer.MAX_VALUE);
</snip>

I have also verified that the FilteredQuery.QUERY_FIRST_FILTER_STRATEGY 
strategy is actually used (i.e. no fallback to LEAP_FROG) like mentioned in the 
docs:

  /**
   * A filter strategy that advances the Query or rather its {@link Scorer} 
first and consults the
   * filter {@link DocIdSet} for each matched document.
   * <p>
   * Note: this strategy requires a {@link DocIdSet#bits()} to return a 
non-null value. Otherwise
   * this strategy falls back to {@link 
FilteredQuery#LEAP_FROG_QUERY_FIRST_STRATEGY}
   * </p>
   * <p>
   * Use this strategy if the filter computation is more expensive than document
   * scoring or if the filter has a linear running time to compute the next
   * matching doc like exact geo distances.
   * </p>
   */

For me, it reads like this strategy fits exactly my use case but there is 
clearly something I'm missing here, so any help/comments appreciated a lot.

Please see my Filter implementation attached if that is of interest (something 
terribly wrong there?). I'm not getting any acceptDocs, so acceptDocs is always 
null (which I thought was the way the query result gets propagated to the 
Filter).

* The background is:

I'm implementing a trigram index with lucene for regex search based on [1] 
which I'm going to evaluate against other regex search solutions (including 
Lucene's AutomatonQuery/RegexpQuery).

The lucene part boils down to having a BooleanQuery (containing trigrams like 
e.g. "OR(AND(hel,ell,llo), AND(wor,orl,rld))") which produces a candidate set, 
i.e. a superset of all actually matching documents. The last step is to verify 
each candidate, i.e. really match the document's content against the regex 
pattern. Obviously the goal is to reduce the candidate set as much as possible 
(via the BooleanQuery) and do the more expensive regex matching on as little 
documents as possible.

I'm on lucene 4.6.

Thank you,

Regards
Andreas

[1] http://swtch.com/~rsc/regexp/regexp4.html

import java.io.IOException;

public class PatternFilter extends Filter {

  private final Pattern pattern;
  private final String field;

  public PatternFilter(Pattern regex, String field) {
    checkArgument(!field.isEmpty());

    this.pattern = checkNotNull(regex);
    this.field = checkNotNull(field);
  }

  public PatternFilter(String regex, String field) {
    this(Pattern.compile(regex), field);
  }

  @Override
  public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {
    return correctBits(context.reader(), acceptDocs);
  }

  private FixedBitSet correctBits(AtomicReader reader, Bits acceptDocs) throws IOException {
    FixedBitSet bits = new FixedBitSet(reader.maxDoc()); // assume all are
                                // INvalid

    Bits liveDocs = reader.getLiveDocs();

    Set<String> fieldsToLoad = new HashSet<String>();
    fieldsToLoad.add(field);

    for (int docID = 0; docID < reader.maxDoc(); docID++) {

      if (liveDocs != null && !(liveDocs.get(docID))) {
        // document is not alive anymore...
        continue;
      }

      if (acceptDocs != null && !(acceptDocs.get(docID))) {
        continue;
      }

      // expensive?
      Document document = reader.document(docID, fieldsToLoad);
      String content = document.get(field);

      if (content == null) {
        // field is not present
        continue;
      }

      // expensive!
      if (pattern.matcher(content).find()) {
        bits.set(docID);
      }

    }

    return bits;
  }


}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

expensive post filtering of a query's result

Reply via email to