On 2/8/07, Peter W. <[EMAIL PROTECTED]> wrote:
Using a parser to get text out of HTML, XML (including RSS, ATOM) is
only
easy if you control the source documents.
HTML pages in the wild are much different, generating exceptions you
must
catch and deal with.
Yes, that's why the Solr version isn
Hello,
Using a parser to get text out of HTML, XML (including RSS, ATOM) is
only
easy if you control the source documents.
HTML pages in the wild are much different, generating exceptions you
must
catch and deal with. For most projects you can probably use
java.util.regex
to obtain keywo
Solr has an HTMLStripReader used by an two different tokenizers for doing
the basics of ignoring tags when reading text ... it has one known bug
when dealing with highlighting...
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripReader.html
http://lucene.apache.org/solr/api/org/
Sure, just don't index the html tags in the first place. Of course that
means you need to parse the document first. Here's a parser that was
mentioned on the thread a while ago
http://sourceforge.net/projects/mozillaparser
There may very well be others
Depending on how sophisticated you