Hi, Some time ago, I had to modify and extend the Lucene StandardTokenizer grammar (flex file) so that it preserves URIs (based on RFC3986). I have extracted the files from my project and published the source code on github [1] under the Apache License 2.0, if it can help.
[1] http://github.com/rdelbru/lucene-uri-preserving-standard-tokenizer -- Renaud Delbru -----Original Message----- From: Sudha Verma [mailto:verma.su...@gmail.com] Sent: Thu 11/19/2009 9:35 PM To: java-user@lucene.apache.org Subject: Re: Keep URLs intact and not tokenized by the StandardTokenizer Thanks. I was hoping Lucene would already have a solution for this since it seems like it would be a common problem. I am new to the lucene API. If I were to implement something from scratch, are my options to extend the Tokenizer to support http regex and then pass the text to StandardTokenizer... -sudha On Thu, Nov 19, 2009 at 12:15 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Sudha, > > In the past, I've built regexes to recognize URLs using the information > here: > > http://www.foad.org/~abigail/Perl/url2.html > > The above, however, is currently a dead link. > > Here's the Internet Archive's WayBack Machine's cache of this page from > August 2007: > > < > http://web.archive.org/web/20070807114147/http://www.foad.org/~abigail/Perl/url2.html > > > > Here's the same content, of unknown vintage, as a text file (even though it > has a .html extension): > > http://nerxs.com/mirrorpages/urlregex.html > > Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition > (but not the 1st edition), has a section on recognizing URLs in Chapter 5. > > Steve > > On 11/19/2009 at 12:58 AM, Sudha Verma wrote: > > Hi, > > > > I am using lucene 2-9-1. > > > > I am reading in free text documents which I index using lucene and the > > StandardAnalyzer at the moment. > > > > The StandardAnalyzer keeps email addresses intact and does not tokenize > > them. Is there something similar for > > URLs? This seems like a common need. So, I thought I'd check if there > > is anything out there that does it already. > > > > I'd appreciate any help. > > > > Thanks, > > sudha > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org