RE: Multiline Regex with Lucene

Steven A Rowe Tue, 28 Jul 2009 09:25:32 -0700

Hi ba3,

StandardAnalyzer breaks text into individual terms, removes most punctuation, 
downcases, removes stopwords, etc.  Your example text becomes the following 
sequence of terms:

        1. hello
        2. world
        3. searched   (this,is,to,be are all in the default stopword set)
        4. test       ('#', like most other punctuation, is removed)
        ...

The regular expression you gave is matched against the text of each of the 
individual tokens produced by the Analyzer (first "hello", then "world", etc.), 
so you'll never get a match.

There are two routes I can think of that you could take:

1. Use KeywordAnalyzer instead of StandardAnalyzer when you index the contents 
field.  That would place the entire text in a single field value, and would 
allow your regex to match the full text, rather than pieces of it.  This is not 
scalable, though -- if your index size is non-trivial, I don't think you want 
to go this route.

2. Use SpanQuery's.  See the nice article by Mark Miller on the subject over at 
Lucid Imagination: 
<http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/>.  

If you continue to use StandardAnalyzer, you may want to customize the stopword 
set it uses, if you have not already done so, so that you don't lose terms you 
want to match against.  

Since in your example you're trying to match against punctuation ('#'), you may 
want to switch to a less-restrictive Analyzer, e.g. WhitespaceAnalyzer, which 
just breaks text at whitespace sequences, and doesn't modify the resulting 
terms any further.

Steve

> -----Original Message-----
> From: ba3 [mailto:sbadhrin...@gmail.com]
> Sent: Monday, July 27, 2009 9:29 PM
> To: java-user@lucene.apache.org
> Subject: RE: Multiline Regex with Lucene
> 
> 
> Hi Steve,
> 
> I had used the standardanalyzer. Should a different one be used ?
> 
> --
> Ba3
> 
> 
> Steven A Rowe wrote:
> >
> > Hi ba3,
> >
> > What analyzer did you use when indexing the content field?
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: ba3
> >> Sent: Sunday, July 26, 2009 9:53 AM
> >> To: java-user@lucene.apache.org
> >> Subject: Multiline Regex with Lucene
> >>
> >>
> >> I was trying to do a regex search with the lucene and
> >> JavaUtilRegexCapabilities.
> >> The code used is :
> >> RegexQuery query = new RegexQuery(new
> >> Term("contents","(?m)hello.*(\r[^#]*)This is to be
> >> searched.*(\r[^#]*)#"));
> >> query.setRegexImplementation(new JavaUtilRegexCapabilities());
> >>
> >> I verified the regex in : http://www.gskinner.com/RegExr/  [with the
> >> multi line checked]
> >> In lucene though there are no hits. Can you please point me in the
> >> right direction
> >>
> >> -- Rgds
> >> Ba3
> >>
> >> Regex :
> >> hello.*(\r[^#]*)This is to be searched.*(\r[^#]*)#
> >>
> >> Content :
> >> hello world
> >> This is to be searched
> >> #
> >> Test line should not be selected
> >> hello
> >> This should not work
> >> some other lines
> >> #
> >> Not to be selected
> >> hello world
> >> Some lines
> >> This is to be searched
> >> Some lines
> >> #
> >> hello earth
> >> some lines
> >> #
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Multiline-Regex-with-
> Lucene-tp24667109p24691023.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Multiline Regex with Lucene

Reply via email to