Re: Spaces in regular expressions

Kudrettin Güleryüz Mon, 15 Feb 2016 05:31:21 -0800

Since documents are source code, I am considering matching on operators too.


Using whitespace analyzer, A=foo(){ would be a single term, A = foo () {
would be  five terms. Different documents can have a different combination
of the identifiers and operators in the example. A regexp query like
/A\s*=\s*foo\s*()\s*{/ could match all of them if multi term regexp was
allowed. Is it not allowed by default, or not possible at all? Any
suggestions?

On Sat, Feb 13, 2016 at 5:09 PM Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> Just to be clear, the whitespace tokenizer would treat "A=foo(){" as a
> single token. I presume you want "A" and "foo" to be separate terms.
>
> You still haven't indicated what regex you were considering. Try explaining
> your query in plain English. I mean, do you want to search for two keywords
> with any operator sequence between them? Or... do you want to match on
> operators as well but simply want to ignore whitespace?
>
> Generally, the standard analyzer/tokenizer is better/easier - you can
> simply query "A foo" and it will match all three of you statements.
>
> -- Jack Krupansky
>
> On Sat, Feb 13, 2016 at 4:29 PM, Kudrettin Güleryüz <kudret...@gmail.com>
> wrote:
>
> >  As mentioned, document is a source code. As you know all below statments
> > are equal:
> > A = foo() {
> > A=foo(){
> > A= foo(){
> > ...
> >
> > With standard whitespace analyzer in action statements wanted to match
> can
> > be on one to five terms in this case. If spacing is definite, I could go
> > either a phrase search or regexep. Any suggestions for this case?
> >
> >
> >
> > On Sat, Feb 13, 2016 at 1:34 PM Jack Krupansky <jack.krupan...@gmail.com
> >
> > wrote:
> >
> > > Obviously you wouldn't need to do a regex for simply terms like foo and
> > bar
> > > - just use simple terms and quoted phrase to match "foo bar". If you
> > really
> > > do need to do complex pattern regexes and match across adjacent terms,
> > your
> > > best bet is to keep a copy of the source text in a separate string (not
> > > tokenized text) field and then you can do a complex regex that spans
> > terms
> > > (and only do that if normal span queries don't do what you need.)
> > >
> > > What does your typical cross-term regex actually look like?
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler <u...@thetaphi.de>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > That's very easy to explain: Regexp queries only work on terms, you
> > > > already said it in your introduction. There is no phrase query in
> > Lucene
> > > > that accepts regular expressions.
> > > >
> > > > Uwe
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > http://www.thetaphi.de
> > > > eMail: u...@thetaphi.de
> > > >
> > > > > -----Original Message-----
> > > > > From: Kudrettin Güleryüz [mailto:kudret...@gmail.com]
> > > > > Sent: Saturday, February 13, 2016 7:14 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Spaces in regular expressions
> > > > >
> > > > > Hello,
> > > > >
> > > > > I am using standard whitespace analyzer to index a source code
> > document
> > > > > using Lucene 5.
> > > > >
> > > > > I understand that a document with content foo bar would have only
> two
> > > > > terms: foo and bar.  When I search for "foo bar" it normally
> matches
> > > the
> > > > > document. Similarly a regexp query /foo/ or /bar/ also matches the
> > > > > document.
> > > > >
> > > > > Can you help me understand why doesn't a regexp query like /foo
> bar/
> > > > > doesn't match the document?
> > > > >
> > > > > Thank you,
> > > > > Kudret
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Re: Spaces in regular expressions

Reply via email to