Re: How to not tokenize HTML tag from input string

Erick Erickson Wed, 07 Feb 2007 18:19:01 -0800

Sure, just don't index the html tags in the first place. Of course that
means you need to parse the document first. Here's a parser that was
mentioned on the thread a while ago....


http://sourceforge.net/projects/mozillaparser

There may very well be others....

Depending on how sophisticated you need to be, you might be able to do a
regular expression to remove all the HTML tags...

Best
Erick

On 2/7/07, Joe Tang <[EMAIL PROTECTED]> wrote:



My work is to index keywords with a document. In my case, the document is
made up with HTML tags which i don't want to index them.

For example:
Input Document:
<div id="tp-wrapper">
<span id="tp-top-right">You are welcome</span>
<div id="tp-tab">
<h1>Testing text</h1>
</div>
</div>

Expected Keywords:
keywords:You
keywords:are
keywords:welcome
keywords:Testing
keywords:text

Is there anyway I can make them not to be one of the keywords?
--
View this message in context:
http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to not tokenize HTML tag from input string

Reply via email to