Re: Text extraction from HTML

2005-07-29 Thread Jack Tang
Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello, > I'm working to the development of a multi-agen

Re: Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
I have tried both HtmlParser v1.5 and NekoHTML. About the former my implementation doesn't work as i.e. it get text from javascripts; I have followed the hint from http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html The following is my NOT working implement

Re: Text extraction from HTML

2005-07-29 Thread Patrick Kimber
Hi Giovanni We are using the Neko HTML parser. Some simple example code can be found in the "Lucene in Action" book. For more information: http://www.manning.com/books/hatcher2 http://www.apache.org/~andyc/neko/doc/html/ Patrick On 29/07/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello,

Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the HtmlParse