The little self-contained program below runs regex queries for a few regexps against a few phrases for both the java.util and jakarta regexp packages.
Output when run with lucene 2.4.1 and jakarta-regexp 1.5 is Added Knowing yourself Added Old clinic Added INSIDE Added Not INSIDE Default regexcapabilities=org.apache.lucene.search.regex.javautilregexcapabilit...@0 org.apache.lucene.search.regex.javautilregexcapabilit...@0 0 hits for text:.in 2 hits for text:.*in 0 hits for text:.IN 2 hits for text:.*IN org.apache.lucene.search.regex.jakartaregexpcapabilit...@0 2 hits for text:.in 2 hits for text:.*in 1 hits for text:.IN 2 hits for text:.*IN Hope that helps. -- Ian. import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.document.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.search.*; import org.apache.lucene.search.regex.*; public class luctest { public static void main(String[] _args) throws Exception { RAMDirectory rdir = new RAMDirectory(); IndexWriter writer = new IndexWriter(rdir, new StandardAnalyzer(), true); String[] docterms = { "Knowing yourself", "Old clinic", "INSIDE", "Not INSIDE" }; for (String s : docterms) { Document d = new Document(); d.add(new Field("text", s, Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(d); System.out.printf("Added %s\n", s); } writer.close(); IndexSearcher searcher = new IndexSearcher(rdir); String[] queries = { ".in", ".*in", ".IN", ".*IN" }; RegexCapabilities[] rcaps = { new JavaUtilRegexCapabilities(), new JakartaRegexpCapabilities() }; RegexQuery qx = new RegexQuery(new Term("x", "x")); System.out.printf("\nDefault RegexCapabilities=%s\n\n", qx.getRegexImplementation()); for (RegexCapabilities rcap : rcaps) { System.out.println(rcap); for (String s : queries) { Term t = new Term("text", s); RegexQuery q = new RegexQuery(t); q.setRegexImplementation(rcap); Hits h = searcher.search(q); System.out.printf("%s hits for %s\n", h.length(), q.toString()); } } } } On Mon, May 11, 2009 at 1:39 PM, Huntsman84 <tpgarci...@gmail.com> wrote: > > The RegexQuery class uses that package, and for that reason the expression > matches. > > If my records contained only one word each, this code would work, but I need > to apply that regular expression to a phrase... > > > Ian Lea wrote: >> >> The default regex package is java.util.regex and I can't see anywhere >> that you tell it to use the Jakarta regexp package. So I don't think >> that ".in" will match. Also, you are storing your contents field as >> NOT_ANALYZED so you will need to be wary of case sensitivity. Maybe >> this is what you want, but maybe not. >> >> >> -- >> Ian. >> >> >> On Mon, May 11, 2009 at 9:00 AM, Huntsman84 <tpgarci...@gmail.com> wrote: >>> >>> This is the code for searching: >>> >>> String index = "index"; >>> String field = "contents"; >>> IndexReader reader = IndexReader.open(index); >>> Searcher searcher = new IndexSearcher(reader); >>> >>> System.out.println("Enter query: "); >>> String line = ".IN.";//in jakarta regexp this is like * IN * >>> RegexQuery rxquery = new RegexQuery(new Term(field,line)); >>> Hits hits = searcher.search(rxquery); >>> >>> if(hits!=null){ >>> for(int k = 0; k<100 && k<hits.length(); k++){ >>> if(hits.doc(k)!=null) >>> >>> System.out.println(hits.doc(k).getField("contents").stringValue()); >>> } >>> } >>> >>> >>> >>> And this is the part of creating the index: >>> >>> >>> File directory = new File("index"); >>> IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), >>> true, >>> IndexWriter.MaxFieldLength.LIMITED); >>> List<String> records = getRecords();//returns a list of record values >>> from >>> database, all of them are phrases >>> Iterator<String> i = records.iterator(); >>> while(i.hasNext()){ >>> Document doc = new Document(); >>> doc.add(new Field(field, i.next(), Field.Store.YES, >>> Field.Index.NOT_ANALYZED)); >>> writer.addDocument(doc); >>> } >>> writer.optimize(); >>> writer.close(); >>> >>> >>> >>> This code works as I want but just matching with the first word of the >>> phrase. I think the problem is the index building, but I don't know how >>> to >>> fix it... >>> >>> Any ideas? >>> >>> Thank you so much!! >>> >>> >>> >>> Steven A Rowe wrote: >>>> >>>> On 5/8/2009 at 9:13 AM, Ian Lee wrote: >>>>> I'm surprised that it matches either - don't you need ".*in" where .* >>>>> means match any character zero or more times? See the javadoc for >>>>> java.util.regex.Pattern, or for Jakarta Regexp if you are using that >>>>> package. >>>>> >>>>> Unless you're an expert in regexps it is probably worth playing with >>>>> them outside your lucene code to start with e.g. with simple >>>>> String.matches(regexp) calls. They can take some getting used to. >>>>> And try to avoid anything with backslashes if you can! >>>> >>>> The java.util.regex.Pattern implementation (the default RegexQuery >>>> implementation) actually uses Matcher.lookingAt(), which is equivalent >>>> to >>>> prepending a "^" anchor to the beginning of the pattern, so if >>>> Huntsman84 >>>> is using the default implementation, then I agree with Ian: I'm >>>> surprised >>>> it matches either. >>>> >>>> However, the Jakarta Regexp implementation uses RE.match(), which does >>>> *not* require a beginning-of-string match. >>>> >>>> Hunstman84, are you using the Jakarta Regexp implementation? If so, >>>> then >>>> like you, I'm surprised it's not matching both :). >>>> >>>> It would be useful to see some real code, including how you index your >>>> records. >>>> >>>> Steve >>>> >>>>> On Fri, May 8, 2009 at 1:42 PM, Huntsman84 <tpgarci...@gmail.com> >>>>> wrote: >>>>> > >>>>> > Hi, >>>>> > >>>>> > I am using RegexQuery for searching in a set of records wich are >>>>> > phrases of several words each. My aim is to find any phrase that >>>>> > contains the given group of letters (e.g. "in"). For that case, >>>>> > I am building the query with the regular expression ".in.", so it >>>>> > should return all phrases with contain "in", but the search only >>>>> > matches with the first word of the phrase. >>>>> > >>>>> > For example, if my records are "Knowing yourself" and "Old >>>>> > clinic", the correct search would return 2 matches, but it only >>>>> > matches with "Knowing yourself". >>>>> > >>>>> > How could I fix this? >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23478720.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- > View this message in context: > http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23482532.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org