Re: RegexQuery Incomplete Results

Ian Lea Mon, 11 May 2009 07:30:28 -0700

The little self-contained program below runs regex queries for a few
regexps against a few phrases for both the java.util and jakarta
regexp packages.


Output when run with lucene 2.4.1 and jakarta-regexp 1.5 is

Added Knowing yourself
Added Old clinic
Added INSIDE
Added Not INSIDE

Default 
regexcapabilities=org.apache.lucene.search.regex.javautilregexcapabilit...@0

org.apache.lucene.search.regex.javautilregexcapabilit...@0
0 hits for text:.in
2 hits for text:.*in
0 hits for text:.IN
2 hits for text:.*IN
org.apache.lucene.search.regex.jakartaregexpcapabilit...@0
2 hits for text:.in
2 hits for text:.*in
1 hits for text:.IN
2 hits for text:.*IN

Hope that helps.

--
Ian.


import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.search.*;
import org.apache.lucene.search.regex.*;

public class luctest {

    public static void main(String[] _args) throws Exception {
        RAMDirectory rdir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(rdir, new StandardAnalyzer(), 
true);
        String[] docterms = { "Knowing yourself",
                              "Old clinic",
                              "INSIDE",
                              "Not INSIDE" };

        for (String s : docterms) {
            Document d = new Document();
            d.add(new Field("text",
                            s,
                            Field.Store.YES,
                            Field.Index.NOT_ANALYZED));
            writer.addDocument(d);
            System.out.printf("Added %s\n", s);
        }
        writer.close();

        IndexSearcher searcher = new IndexSearcher(rdir);
        String[] queries = { ".in", ".*in", ".IN", ".*IN" };
        RegexCapabilities[] rcaps = { new JavaUtilRegexCapabilities(),
                                      new JakartaRegexpCapabilities() };
        RegexQuery qx = new RegexQuery(new Term("x", "x"));
        System.out.printf("\nDefault RegexCapabilities=%s\n\n",
                          qx.getRegexImplementation());
        for (RegexCapabilities rcap : rcaps) {
            System.out.println(rcap);
            for (String s : queries) {
                Term t = new Term("text", s);
                RegexQuery q = new RegexQuery(t);
                q.setRegexImplementation(rcap);
                Hits h = searcher.search(q);
                System.out.printf("%s hits for %s\n",
                                  h.length(),
                                  q.toString());
            }
        }
    }
}


On Mon, May 11, 2009 at 1:39 PM, Huntsman84 <tpgarci...@gmail.com> wrote:
>
> The RegexQuery class uses that package, and for that reason the expression
> matches.
>
> If my records contained only one word each, this code would work, but I need
> to apply that regular expression to a phrase...
>
>
> Ian Lea wrote:
>>
>> The default regex package is java.util.regex and I can't see anywhere
>> that you tell it to use the Jakarta regexp package.  So I don't think
>> that ".in" will match.  Also, you are storing your contents field as
>> NOT_ANALYZED so you will need to be wary of case sensitivity.  Maybe
>> this is what you want, but maybe not.
>>
>>
>> --
>> Ian.
>>
>>
>> On Mon, May 11, 2009 at 9:00 AM, Huntsman84 <tpgarci...@gmail.com> wrote:
>>>
>>> This is the code for searching:
>>>
>>> String index = "index";
>>> String field = "contents";
>>> IndexReader reader = IndexReader.open(index);
>>> Searcher searcher = new IndexSearcher(reader);
>>>
>>> System.out.println("Enter query: ");
>>> String line = ".IN.";//in jakarta regexp this is like * IN *
>>> RegexQuery rxquery = new RegexQuery(new Term(field,line));
>>> Hits hits = searcher.search(rxquery);
>>>
>>> if(hits!=null){
>>>    for(int k = 0; k<100 && k<hits.length(); k++){
>>>        if(hits.doc(k)!=null)
>>>
>>>  System.out.println(hits.doc(k).getField("contents").stringValue());
>>>    }
>>> }
>>>
>>>
>>>
>>> And this is the part of creating the index:
>>>
>>>
>>> File directory = new File("index");
>>> IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(),
>>> true,
>>>                            IndexWriter.MaxFieldLength.LIMITED);
>>> List<String> records = getRecords();//returns a list of record values
>>> from
>>> database, all of them are phrases
>>> Iterator<String> i = records.iterator();
>>> while(i.hasNext()){
>>>           Document doc = new Document();
>>>           doc.add(new Field(field, i.next(), Field.Store.YES,
>>> Field.Index.NOT_ANALYZED));
>>>        writer.addDocument(doc);
>>> }
>>> writer.optimize();
>>> writer.close();
>>>
>>>
>>>
>>> This code works as I want but just matching with the first word of the
>>> phrase. I think the problem is the index building, but I don't know how
>>> to
>>> fix it...
>>>
>>> Any ideas?
>>>
>>> Thank you so much!!
>>>
>>>
>>>
>>> Steven A Rowe wrote:
>>>>
>>>> On 5/8/2009 at 9:13 AM, Ian Lee wrote:
>>>>> I'm surprised that it matches either - don't you need ".*in" where .*
>>>>> means match any character zero or more times?  See the javadoc for
>>>>> java.util.regex.Pattern, or for Jakarta Regexp if you are using that
>>>>> package.
>>>>>
>>>>> Unless you're an expert in regexps it is probably worth playing with
>>>>> them outside your lucene code to start with e.g. with simple
>>>>> String.matches(regexp) calls.  They can take some getting used to.
>>>>> And try to avoid anything with backslashes if you can!
>>>>
>>>> The java.util.regex.Pattern implementation (the default RegexQuery
>>>> implementation) actually uses Matcher.lookingAt(), which is equivalent
>>>> to
>>>> prepending a "^" anchor to the beginning of the pattern, so if
>>>> Huntsman84
>>>> is using the default implementation, then I agree with Ian: I'm
>>>> surprised
>>>> it matches either.
>>>>
>>>> However, the Jakarta Regexp implementation uses RE.match(), which does
>>>> *not* require a beginning-of-string match.
>>>>
>>>> Hunstman84, are you using the Jakarta Regexp implementation?  If so,
>>>> then
>>>> like you, I'm surprised it's not matching both :).
>>>>
>>>> It would be useful to see some real code, including how you index your
>>>> records.
>>>>
>>>> Steve
>>>>
>>>>> On Fri, May 8, 2009 at 1:42 PM, Huntsman84 <tpgarci...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I am using RegexQuery for searching in a set of records wich are
>>>>> > phrases of several words each. My aim is to find any phrase that
>>>>> > contains the given group of letters (e.g. "in"). For that case,
>>>>> > I am building the query with the regular expression ".in.", so it
>>>>> > should return all phrases with contain "in", but the search only
>>>>> > matches with the first word of the phrase.
>>>>> >
>>>>> > For example, if my records are "Knowing yourself" and "Old
>>>>> > clinic", the correct search would return 2 matches, but it only
>>>>> > matches with "Knowing yourself".
>>>>> >
>>>>> > How could I fix this?
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23478720.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23482532.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: RegexQuery Incomplete Results

Reply via email to