the most recent
> patch on the following JIRA issue:
>
> https://issues.apache.org/jira/browse/LUCENE-2167
>
> It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and
> e-mails too, in accordance with the relevant IETF RFCs.
>
> Steve
>
> > -O
(S), FTP, and FILE) URLs together as single tokens, and
> e-mails too, in accordance with the relevant IETF RFCs.
>
> Steve
>
> > -Original Message-
> > From: Sudha Verma [mailto:verma.su...@gmail.com]
> > Sent: Wednesday, June 23, 2010 2:07 PM
> > To: jav
Hi,
I am new to lucene and I am using Lucene 3.0.2.
I am using Lucene to parse text which may contain URLs. I noticed the
StandardTokenizer keeps the email addresses in one token, but not the URLs.
I also looked at Solr wiki pages, and even though the wiki page for
solr.StandardTokenizerFactory s
t file (even though it
> has a .html extension):
>
> http://nerxs.com/mirrorpages/urlregex.html
>
> Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition
> (but not the 1st edition), has a section on recognizing URLs in Chapter 5.
>
> Steve
>
Hi,
I am using lucene 2-9-1.
I am reading in free text documents which I index using lucene and the
StandardAnalyzer at the moment.
The StandardAnalyzer keeps email addresses intact and does not tokenize
them. Is there something similar for
URLs? This seems like a common need. So, I thought I'd