subject:"Re\: How to not tokenize HTML tag from input string"

Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Yonik Seeley

On 2/8/07, Peter W. <[EMAIL PROTECTED]> wrote: Using a parser to get text out of HTML, XML (including RSS, ATOM) is only easy if you control the source documents. HTML pages in the wild are much different, generating exceptions you must catch and deal with. Yes, that's why the Solr version isn

Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Peter W.

Hello, Using a parser to get text out of HTML, XML (including RSS, ATOM) is only easy if you control the source documents. HTML pages in the wild are much different, generating exceptions you must catch and deal with. For most projects you can probably use java.util.regex to obtain keywo

Re: How to not tokenize HTML tag from input string

2007-02-08 Thread Chris Hostetter

Solr has an HTMLStripReader used by an two different tokenizers for doing the basics of ignoring tags when reading text ... it has one known bug when dealing with highlighting... http://lucene.apache.org/solr/api/org/apache/solr/analysis/HTMLStripReader.html http://lucene.apache.org/solr/api/org/

Re: How to not tokenize HTML tag from input string

2007-02-07 Thread Erick Erickson

Sure, just don't index the html tags in the first place. Of course that means you need to parse the document first. Here's a parser that was mentioned on the thread a while ago http://sourceforge.net/projects/mozillaparser There may very well be others Depending on how sophisticated you

Re: How to not tokenize HTML tag from input string

Re: How to not tokenize HTML tag from input string

Re: How to not tokenize HTML tag from input string

Re: How to not tokenize HTML tag from input string

4 matches

Site Navigation

Mail list logo

Footer information