Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net
http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello, > I'm working to the development of a multi-agents software that > involves some information indexing, information retrieval and > information categorization tasks. I want to build the training set for > categorization using a set of HTML pages fetched from DMOZ RDF dumps. > I have tried the HtmlParser coming with Nutch but I wasn't able to > make it work without adjusting global configuration Nutch's xml; > perhaps it's the only way to make such plugin work? Does Lucene expose > any good HTML parser in the contrib section to parse web pages found > in the wild? > > Best regards, > Giovanni Novelli > > P.S.: This is a crosspost as I'm relying on both Lucene and Nutch. > -- Keep Discovering ... ... http://www.jroller.com/page/jmars --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]