Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Yonik Seeley
On 2/8/07, Peter W. <[EMAIL PROTECTED]> wrote: Using a parser to get text out of HTML, XML (including RSS, ATOM) is only easy if you control the source documents. HTML pages in the wild are much different, generating exceptions you must catch and deal with. Yes, that's why the Solr version isn

Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Peter W.
Hello, Using a parser to get text out of HTML, XML (including RSS, ATOM) is only easy if you control the source documents. HTML pages in the wild are much different, generating exceptions you must catch and deal with. For most projects you can probably use java.util.regex to obtain keywo

Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Chris Hostetter
Solr has an HTMLStripReader used by an two different tokenizers for doing the basics of ignoring tags when reading text ... it has one known bug when dealing with highlighting... http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripReader.html http://lucene.apache.org/solr/api/org/

Re: How to not tokenize HTML tag from input string

2007-02-07 Thread Erick Erickson
Sure, just don't index the html tags in the first place. Of course that means you need to parse the document first. Here's a parser that was mentioned on the thread a while ago http://sourceforge.net/projects/mozillaparser There may very well be others Depending on how sophisticated you