Re: HTML text extraction

Chris Hostetter Wed, 21 Jun 2006 00:37:54 -0700

if you just want something to extract the text from HTML, without trying
to extract structure (ie: you don't care about title vs h1 vs bold vs meta
keywords) then the HTMLStripReader (or
HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be
usefull.  It wasn't intended to deal with full HTML documents (hence it
doesn't have any mechanism for infering strucutre) but it was intended to
do the best job possible when deling with dirty data that might be plain
text, or it might be a chunk of HTML, or it might be mostly plain text
with a little bit of html sprinkled in.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML text extraction

Reply via email to