Re: URL Tokenization

2010-06-25 Thread Sudha Verma
the most recent > patch on the following JIRA issue: > > https://issues.apache.org/jira/browse/LUCENE-2167 > > It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and > e-mails too, in accordance with the relevant IETF RFCs. > > Steve > > > -O

Re: URL Tokenization

2010-06-24 Thread Sudha Verma
(S), FTP, and FILE) URLs together as single tokens, and > e-mails too, in accordance with the relevant IETF RFCs. > > Steve > > > -Original Message- > > From: Sudha Verma [mailto:verma.su...@gmail.com] > > Sent: Wednesday, June 23, 2010 2:07 PM > > To: jav

URL Tokenization

2010-06-23 Thread Sudha Verma
Hi, I am new to lucene and I am using Lucene 3.0.2. I am using Lucene to parse text which may contain URLs. I noticed the StandardTokenizer keeps the email addresses in one token, but not the URLs. I also looked at Solr wiki pages, and even though the wiki page for solr.StandardTokenizerFactory s

Re: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Sudha Verma
t file (even though it > has a .html extension): > > http://nerxs.com/mirrorpages/urlregex.html > > Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition > (but not the 1st edition), has a section on recognizing URLs in Chapter 5. > > Steve >

Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-18 Thread Sudha Verma
Hi, I am using lucene 2-9-1. I am reading in free text documents which I index using lucene and the StandardAnalyzer at the moment. The StandardAnalyzer keeps email addresses intact and does not tokenize them. Is there something similar for URLs? This seems like a common need. So, I thought I'd