Re: Text extraction from HTML

Jack Tang Fri, 29 Jul 2005 04:37:49 -0700

Hi Novelli

Do you insist on HtmlParser in Nutch? 
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net


http://htmlparser.sourceforge.net/

Regards
/Jack

On 7/29/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Text extraction from HTML

Reply via email to