Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Sirish Vadala Mon, 14 Mar 2011 11:20:20 -0700

I had exactly the same requirement to parse and index offline html files. I
had written my own HTML scanner using
javax.swing.text.html.HTMLEditorKit.Parser. It sounds difficult, but pretty
simple and straight forward to implement, a simple 40 line java class did
the job for me.



shrinath.m wrote:
> 
> On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] <
> ml-node+2664380-1940163870-376...@n3.nabble.com> wrote:
> 
>>   But I think the parser will most be used when crawling. So you can use
>> these parsers when crawling and save parsed result only.
>>
> 
> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?
> 
> 
> How does Solr do it ?
> 
> 
> -- 
> Regards
> Shrinath.M
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2676832.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Reply via email to