Re: PrefixQuery with short prefix does not match documents

Steven Schlansker Tue, 28 May 2013 17:20:48 -0700

Hi Mike,

Thank you for the pointer, that is indeed the cause here.
The reason I added the rewrite was to preserve the boost of the field on 
matches.
Specifically, some results have a field boost of log(popularity) and others 
have a field boost of 100 to float them to the top.


Without the rewriter, all matches get the same score, so the results are more 
or less arbitrary.
It seems that I cannot expect to get scored results for more than 
booleanMaxClauseCount on a prefix query, at least from my reading of 
MultiTermQuery's nested classes.

Is there a better way to indicate "popularity" than field boosts, that might 
work with PrefixQuery?  Or am I asking for the impossible here?



The EdgeNGramFilter looks very interesting, and I suspect it is basically 
exactly what I want.  But I am going to be expected to ship this to production 
soon.  How confident are you of the quality of the current patch?  I am willing 
to deal with some level of pain (it's pretty clear that I'd have to redo the 
way that it does indexing, for example, to read from my data source instead of 
a file…) but I am going to look like a fool if it crashes all over the place :-)

Thanks,
Steven

On May 25, 2013, at 8:44 AM, Michael McCandless <luc...@mikemccandless.com> 
wrote:

> I suspect this is because you set TopTermsScoringBooleanQueryRewrite
> method on the PrefixQuery: this will keep "only" the top 10K terms, so
> if g* matches more than 10K terms, some terms are dropped.
> 
> You may want to index short prefixes into the index instead, e.g.
> using EdgeNGramFilter, and then cutover to PrefixQuery when the prefix
> is "long enough".
> 
> This is the approach I took with the index-based suggester on
> https://issues.apache.org/jira/browse/LUCENE-4845 ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Fri, May 24, 2013 at 7:06 PM, Steven Schlansker <ste...@likeness.com> 
> wrote:
>> Hi everyone,
>> 
>> I am building an autocomplete index.  The index contains both the names and 
>> a small set of fixed types.
>> The intention is that type matches will always come first, followed by name 
>> matches.
>> 
>> I am using a PrefixQuery to do substring matching.  Confusingly, I am 
>> finding that very short prefix
>> matches sometimes will return no results when combined with an additional 
>> filter.
>> 
>> For example, I have a document "body:german type:TYPE".  The query 
>> "+(type:TYPE) +body:ge*" matches this document.
>> The query "+(type:TYPE) +body:g*" does not.  Double confusingly, it works 
>> fine in Luke -- just not when I build the query by hand.
>> 
>> Here is how I create the document:
>> 
>> Document doc = new Document();
>> doc.add(new Field("body", "German", TextField.TYPE_STORED));
>> doc.add(new Field("type", "TYPE", StringField.TYPE_STORED));
>> 
>> Here is how I build the query:
>> 
>> Query allowedTypes = new BooleanQuery();
>> allowedTypes.add(new TermQuery(new Term("type", "TYPE")), Occur.SHOULD);
>> 
>> 
>> Query prefixQuery = new PrefixQuery(new Term("body", "ge"));
>> prefixQuery.setRewriteMethod(new 
>> MultiTermQuery.TopTermsScoringBooleanQueryRewrite(10000));
>> 
>> Query mainQuery = new BooleanQuery();
>> mainQuery.add(allowedTypes, Occur.MUST);
>> mainQuery.add(prefixQuery, Occur.MUST);
>> 
>> Am I missing something obvious?
>> 
>> Thanks,
>> Steven Schlansker
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: PrefixQuery with short prefix does not match documents

Reply via email to