Hi ba3, StandardAnalyzer breaks text into individual terms, removes most punctuation, downcases, removes stopwords, etc. Your example text becomes the following sequence of terms:
1. hello 2. world 3. searched (this,is,to,be are all in the default stopword set) 4. test ('#', like most other punctuation, is removed) ... The regular expression you gave is matched against the text of each of the individual tokens produced by the Analyzer (first "hello", then "world", etc.), so you'll never get a match. There are two routes I can think of that you could take: 1. Use KeywordAnalyzer instead of StandardAnalyzer when you index the contents field. That would place the entire text in a single field value, and would allow your regex to match the full text, rather than pieces of it. This is not scalable, though -- if your index size is non-trivial, I don't think you want to go this route. 2. Use SpanQuery's. See the nice article by Mark Miller on the subject over at Lucid Imagination: <http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/>. If you continue to use StandardAnalyzer, you may want to customize the stopword set it uses, if you have not already done so, so that you don't lose terms you want to match against. Since in your example you're trying to match against punctuation ('#'), you may want to switch to a less-restrictive Analyzer, e.g. WhitespaceAnalyzer, which just breaks text at whitespace sequences, and doesn't modify the resulting terms any further. Steve > -----Original Message----- > From: ba3 [mailto:sbadhrin...@gmail.com] > Sent: Monday, July 27, 2009 9:29 PM > To: java-user@lucene.apache.org > Subject: RE: Multiline Regex with Lucene > > > Hi Steve, > > I had used the standardanalyzer. Should a different one be used ? > > -- > Ba3 > > > Steven A Rowe wrote: > > > > Hi ba3, > > > > What analyzer did you use when indexing the content field? > > > > Steve > > > >> -----Original Message----- > >> From: ba3 > >> Sent: Sunday, July 26, 2009 9:53 AM > >> To: java-user@lucene.apache.org > >> Subject: Multiline Regex with Lucene > >> > >> > >> I was trying to do a regex search with the lucene and > >> JavaUtilRegexCapabilities. > >> The code used is : > >> RegexQuery query = new RegexQuery(new > >> Term("contents","(?m)hello.*(\r[^#]*)This is to be > >> searched.*(\r[^#]*)#")); > >> query.setRegexImplementation(new JavaUtilRegexCapabilities()); > >> > >> I verified the regex in : http://www.gskinner.com/RegExr/ [with the > >> multi line checked] > >> In lucene though there are no hits. Can you please point me in the > >> right direction > >> > >> -- Rgds > >> Ba3 > >> > >> Regex : > >> hello.*(\r[^#]*)This is to be searched.*(\r[^#]*)# > >> > >> Content : > >> hello world > >> This is to be searched > >> # > >> Test line should not be selected > >> hello > >> This should not work > >> some other lines > >> # > >> Not to be selected > >> hello world > >> Some lines > >> This is to be searched > >> Some lines > >> # > >> hello earth > >> some lines > >> # > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > -- > View this message in context: http://www.nabble.com/Multiline-Regex-with- > Lucene-tp24667109p24691023.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org