RE: Spaces in regular expressions

Uwe Schindler Fri, 26 Feb 2016 00:53:34 -0800

Hi,

in general you can implement the whole stuff as described in this paper using 
Lucene - you don’t need to customize Lucene for this just use its official apis 
and tokenizers:


You have to build your own Analyzer that builds trigrams and does *not* 
tokenize on whitespace and so on. From me it looks like you have to tokenize 
the source code on newlines (one token per newline of code) and then make 
trigrams out of itHere are the components for an Analyzer (Lucene 6 syntax with 
Java 8, can be rewritten for 5.5, it is just easier to show this way; Lucene 5 
needs subclassing CharTokenizer):

Tokenizer: CharTokenizer.fromSeparatorCharPredicate(ch -> ch == '\n')  // 
expand for other newline, too
TokenFilters: maybe "new LowercaseFilter(tokenizer)", and finally "new 
NGramTokenFilter(wsfilter, 3, 3)" (trigrams on the lowercased tokens)

After that you can index using this analyzer and have trigrams in your index.

You can then use the algorithms as described in the paper to build TermQuery 
instances and combine them with BooleanQuery from the regular expression to 
select the candidate documents. In a final step you can loop through candidate 
results and filter them by applying the "real regex". This would be done 
outside of Lucene code.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Kudrettin Güleryüz [mailto:kudret...@gmail.com]
> Sent: Thursday, February 25, 2016 7:58 PM
> To: java-user@lucene.apache.org
> Subject: Re: Spaces in regular expressions
> 
> Thank you, I had looked at that article a little, some time ago. I was
> thinking I may have to change some lower level Lucene classes to be able to
> work like that. Plus I don't have much clue if that would break things.
> 
> I am primarily looking for a Lucene solution at this point.
> 
> On Thu, Feb 25, 2016 at 10:53 AM Greg Bowyer <gbow...@fastmail.co.uk>
> wrote:
> 
> > Possibly not helpful but some time ago Russ Cox implemented a code
> > search at Google.
> >
> > His design is documented here
> https://swtch.com/~rsc/regexp/regexp4.html
> >
> > On Wed, Feb 24, 2016, at 08:01 AM, Kudrettin Güleryüz wrote:
> > > I appreciate the pointers Jack. More on that, where can I read more on
> > > enabling full regexp support on indexed source code documents using
> > > Lucene?
> > >
> > > Any suggestions regarding cases where developers implemented this
> kind of
> > > capability using Lucene/Solr/ElasticSearch/... would be more than
> > > welcome.
> > >
> > > Thank you,
> > > Kudret
> > >
> > > On Mon, Feb 15, 2016 at 10:22 AM Jack Krupansky
> > > <jack.krupan...@gmail.com>
> > > wrote:
> > >
> > > > You can have two parallel fields, one tokenized as a programming
> > language
> > > > would (identifiers, operators) and one using the keyword tokenizer for
> > each
> > > > line. You have to decide whether to treat each line as a separate
> > Lucene
> > > > document or treat each source file as a multivalued field, one value
> > per
> > > > source line. And then there is the issue of code sequences that span
> > source
> > > > lines.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz <
> > kudret...@gmail.com>
> > > > wrote:
> > > >
> > > > > Since documents are source code, I am considering matching on
> > operators
> > > > > too.
> > > > >
> > > > > Using whitespace analyzer, A=foo(){ would be a single term, A = foo
> > () {
> > > > > would be  five terms. Different documents can have a different
> > > > combination
> > > > > of the identifiers and operators in the example. A regexp query like
> > > > > /A\s*=\s*foo\s*()\s*{/ could match all of them if multi term regexp
> > was
> > > > > allowed. Is it not allowed by default, or not possible at all? Any
> > > > > suggestions?
> > > > >
> > > > > On Sat, Feb 13, 2016 at 5:09 PM Jack Krupansky <
> > jack.krupan...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Just to be clear, the whitespace tokenizer would treat "A=foo(){"
> > as a
> > > > > > single token. I presume you want "A" and "foo" to be separate
> > terms.
> > > > > >
> > > > > > You still haven't indicated what regex you were considering. Try
> > > > > explaining
> > > > > > your query in plain English. I mean, do you want to search for two
> > > > > keywords
> > > > > > with any operator sequence between them? Or... do you want to
> > match on
> > > > > > operators as well but simply want to ignore whitespace?
> > > > > >
> > > > > > Generally, the standard analyzer/tokenizer is better/easier - you
> > can
> > > > > > simply query "A foo" and it will match all three of you statements.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Sat, Feb 13, 2016 at 4:29 PM, Kudrettin Güleryüz <
> > > > kudret...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > >  As mentioned, document is a source code. As you know all below
> > > > > statments
> > > > > > > are equal:
> > > > > > > A = foo() {
> > > > > > > A=foo(){
> > > > > > > A= foo(){
> > > > > > > ...
> > > > > > >
> > > > > > > With standard whitespace analyzer in action statements wanted
> to
> > > > match
> > > > > > can
> > > > > > > be on one to five terms in this case. If spacing is definite, I
> > could
> > > > > go
> > > > > > > either a phrase search or regexep. Any suggestions for this case?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Feb 13, 2016 at 1:34 PM Jack Krupansky <
> > > > > jack.krupan...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Obviously you wouldn't need to do a regex for simply terms
> > like foo
> > > > > and
> > > > > > > bar
> > > > > > > > - just use simple terms and quoted phrase to match "foo bar".
> > If
> > > > you
> > > > > > > really
> > > > > > > > do need to do complex pattern regexes and match across
> adjacent
> > > > > terms,
> > > > > > > your
> > > > > > > > best bet is to keep a copy of the source text in a separate
> > string
> > > > > (not
> > > > > > > > tokenized text) field and then you can do a complex regex that
> > > > spans
> > > > > > > terms
> > > > > > > > (and only do that if normal span queries don't do what you
> > need.)
> > > > > > > >
> > > > > > > > What does your typical cross-term regex actually look like?
> > > > > > > >
> > > > > > > >
> > > > > > > > -- Jack Krupansky
> > > > > > > >
> > > > > > > > On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler <
> > u...@thetaphi.de>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > That's very easy to explain: Regexp queries only work on
> > terms,
> > > > you
> > > > > > > > > already said it in your introduction. There is no phrase
> > query in
> > > > > > > Lucene
> > > > > > > > > that accepts regular expressions.
> > > > > > > > >
> > > > > > > > > Uwe
> > > > > > > > >
> > > > > > > > > -----
> > > > > > > > > Uwe Schindler
> > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > http://www.thetaphi.de
> > > > > > > > > eMail: u...@thetaphi.de
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Kudrettin Güleryüz [mailto:kudret...@gmail.com]
> > > > > > > > > > Sent: Saturday, February 13, 2016 7:14 PM
> > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > Subject: Spaces in regular expressions
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > I am using standard whitespace analyzer to index a source
> > code
> > > > > > > document
> > > > > > > > > > using Lucene 5.
> > > > > > > > > >
> > > > > > > > > > I understand that a document with content foo bar would
> > have
> > > > only
> > > > > > two
> > > > > > > > > > terms: foo and bar.  When I search for "foo bar" it
> > normally
> > > > > > matches
> > > > > > > > the
> > > > > > > > > > document. Similarly a regexp query /foo/ or /bar/ also
> > matches
> > > > > the
> > > > > > > > > > document.
> > > > > > > > > >
> > > > > > > > > > Can you help me understand why doesn't a regexp query
> like
> > /foo
> > > > > > bar/
> > > > > > > > > > doesn't match the document?
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > > Kudret
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > java-user-unsubscr...@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > java-user-h...@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Spaces in regular expressions

Reply via email to