if you just want something to extract the text from HTML, without trying to extract structure (ie: you don't care about title vs h1 vs bold vs meta keywords) then the HTMLStripReader (or HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be usefull. It wasn't intended to deal with full HTML documents (hence it doesn't have any mechanism for infering strucutre) but it was intended to do the best job possible when deling with dirty data that might be plain text, or it might be a chunk of HTML, or it might be mostly plain text with a little bit of html sprinkled in.
-Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]