Re: "Starts with" query?

Erik Hatcher Fri, 06 Jan 2006 04:11:36 -0800


On Jan 6, 2006, at 7:00 AM, Erik Hatcher wrote:

I notice that if I have a title "auto update", then the phrasequery trick works if it searches on
        title:"0start0 auto*"

but does not find any matches for

        title:"0start0 aut*"

I'm a bit stuck.
PhraseQuery does not handle wildcards. Unfortunately this iscommon misunderstanding.
The MultiPhraseQuery could do this provided you expand "aut*" intoall the matching terms yourself. But here is an alternative usingthe new SpanRegexQuery (in contrib/regex):
    RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, newSimpleAnalyzer(), true);
    Document doc = new Document();
doc.add(new Field("field", "auto update", Field.Store.NO,Field.Index.TOKENIZED));
    writer.addDocument(doc);
    doc = new Document();
doc.add(new Field("field", "first auto update", Field.Store.NO,Field.Index.TOKENIZED));
    writer.addDocument(doc);
    writer.optimize();
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
SpanRegexQuery srq = new SpanRegexQuery(new Term("field","aut.*"));
    SpanFirstQuery sfq = new SpanFirstQuery(srq, 1);
    Hits hits = searcher.search(sfq);
    assertEquals(1, hits.length());
Notice that the query is "aut.*", not "aut*" such that it is avalid regular expression for what you want. In my current project,my custom query parser handles * and ? like WildcardQuery, butunder the covers I simply convert that into a regex by replacing ?with . and * with .*

Let me add a major caveat, especially given that Paul's index islarge. (Span)RegexQuery by default, currently, scans through *every*term in the index. This is due to the complexity in determining theprefix of the regex. While it is obvious that "aut.*" should onlyscan through terms starting with "aut", it gets more complicated withexpressions like "a?uto" because the "a" is optional. There is aJakarta Regexp implementation in contrib/regex also and it is capableof determining the static prefix to reduce term enumeration, but Isuspect java.util.regex is much faster than Jakarta Regexp. I'musing, in my project, a blending of the two letting Jakarta Regexpdetermine the prefix but using java.util.regex for matching - thisrequires a custom, and trivial, implementation of RegexCapabilities.I didn't include that in contrib/regex because it seems a bit awkwardfor general consumption.

Anyway, caveat emptor for term enumeration with (Span)RegexQuery!Also, doing term rotation on indexing and with searching can alsogreatly reduce term enumeration even with leading wildcards - butI'll leave that as an exercise for the reader for now :)


        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: "Starts with" query?

Reply via email to