Thank you, I had looked at that article a little, some time ago. I was thinking I may have to change some lower level Lucene classes to be able to work like that. Plus I don't have much clue if that would break things.
I am primarily looking for a Lucene solution at this point. On Thu, Feb 25, 2016 at 10:53 AM Greg Bowyer <gbow...@fastmail.co.uk> wrote: > Possibly not helpful but some time ago Russ Cox implemented a code > search at Google. > > His design is documented here https://swtch.com/~rsc/regexp/regexp4.html > > On Wed, Feb 24, 2016, at 08:01 AM, Kudrettin Güleryüz wrote: > > I appreciate the pointers Jack. More on that, where can I read more on > > enabling full regexp support on indexed source code documents using > > Lucene? > > > > Any suggestions regarding cases where developers implemented this kind of > > capability using Lucene/Solr/ElasticSearch/... would be more than > > welcome. > > > > Thank you, > > Kudret > > > > On Mon, Feb 15, 2016 at 10:22 AM Jack Krupansky > > <jack.krupan...@gmail.com> > > wrote: > > > > > You can have two parallel fields, one tokenized as a programming > language > > > would (identifiers, operators) and one using the keyword tokenizer for > each > > > line. You have to decide whether to treat each line as a separate > Lucene > > > document or treat each source file as a multivalued field, one value > per > > > source line. And then there is the issue of code sequences that span > source > > > lines. > > > > > > -- Jack Krupansky > > > > > > On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz < > kudret...@gmail.com> > > > wrote: > > > > > > > Since documents are source code, I am considering matching on > operators > > > > too. > > > > > > > > Using whitespace analyzer, A=foo(){ would be a single term, A = foo > () { > > > > would be five terms. Different documents can have a different > > > combination > > > > of the identifiers and operators in the example. A regexp query like > > > > /A\s*=\s*foo\s*()\s*{/ could match all of them if multi term regexp > was > > > > allowed. Is it not allowed by default, or not possible at all? Any > > > > suggestions? > > > > > > > > On Sat, Feb 13, 2016 at 5:09 PM Jack Krupansky < > jack.krupan...@gmail.com > > > > > > > > wrote: > > > > > > > > > Just to be clear, the whitespace tokenizer would treat "A=foo(){" > as a > > > > > single token. I presume you want "A" and "foo" to be separate > terms. > > > > > > > > > > You still haven't indicated what regex you were considering. Try > > > > explaining > > > > > your query in plain English. I mean, do you want to search for two > > > > keywords > > > > > with any operator sequence between them? Or... do you want to > match on > > > > > operators as well but simply want to ignore whitespace? > > > > > > > > > > Generally, the standard analyzer/tokenizer is better/easier - you > can > > > > > simply query "A foo" and it will match all three of you statements. > > > > > > > > > > -- Jack Krupansky > > > > > > > > > > On Sat, Feb 13, 2016 at 4:29 PM, Kudrettin Güleryüz < > > > kudret...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > As mentioned, document is a source code. As you know all below > > > > statments > > > > > > are equal: > > > > > > A = foo() { > > > > > > A=foo(){ > > > > > > A= foo(){ > > > > > > ... > > > > > > > > > > > > With standard whitespace analyzer in action statements wanted to > > > match > > > > > can > > > > > > be on one to five terms in this case. If spacing is definite, I > could > > > > go > > > > > > either a phrase search or regexep. Any suggestions for this case? > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 13, 2016 at 1:34 PM Jack Krupansky < > > > > jack.krupan...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Obviously you wouldn't need to do a regex for simply terms > like foo > > > > and > > > > > > bar > > > > > > > - just use simple terms and quoted phrase to match "foo bar". > If > > > you > > > > > > really > > > > > > > do need to do complex pattern regexes and match across adjacent > > > > terms, > > > > > > your > > > > > > > best bet is to keep a copy of the source text in a separate > string > > > > (not > > > > > > > tokenized text) field and then you can do a complex regex that > > > spans > > > > > > terms > > > > > > > (and only do that if normal span queries don't do what you > need.) > > > > > > > > > > > > > > What does your typical cross-term regex actually look like? > > > > > > > > > > > > > > > > > > > > > -- Jack Krupansky > > > > > > > > > > > > > > On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler < > u...@thetaphi.de> > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > That's very easy to explain: Regexp queries only work on > terms, > > > you > > > > > > > > already said it in your introduction. There is no phrase > query in > > > > > > Lucene > > > > > > > > that accepts regular expressions. > > > > > > > > > > > > > > > > Uwe > > > > > > > > > > > > > > > > ----- > > > > > > > > Uwe Schindler > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > > > > > http://www.thetaphi.de > > > > > > > > eMail: u...@thetaphi.de > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Kudrettin Güleryüz [mailto:kudret...@gmail.com] > > > > > > > > > Sent: Saturday, February 13, 2016 7:14 PM > > > > > > > > > To: java-user@lucene.apache.org > > > > > > > > > Subject: Spaces in regular expressions > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > I am using standard whitespace analyzer to index a source > code > > > > > > document > > > > > > > > > using Lucene 5. > > > > > > > > > > > > > > > > > > I understand that a document with content foo bar would > have > > > only > > > > > two > > > > > > > > > terms: foo and bar. When I search for "foo bar" it > normally > > > > > matches > > > > > > > the > > > > > > > > > document. Similarly a regexp query /foo/ or /bar/ also > matches > > > > the > > > > > > > > > document. > > > > > > > > > > > > > > > > > > Can you help me understand why doesn't a regexp query like > /foo > > > > > bar/ > > > > > > > > > doesn't match the document? > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > Kudret > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: > java-user-unsubscr...@lucene.apache.org > > > > > > > > For additional commands, e-mail: > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >