Sure, just don't index the html tags in the first place. Of course that means you need to parse the document first. Here's a parser that was mentioned on the thread a while ago....
http://sourceforge.net/projects/mozillaparser There may very well be others.... Depending on how sophisticated you need to be, you might be able to do a regular expression to remove all the HTML tags... Best Erick On 2/7/07, Joe Tang <[EMAIL PROTECTED]> wrote:
My work is to index keywords with a document. In my case, the document is made up with HTML tags which i don't want to index them. For example: Input Document: <div id="tp-wrapper"> <span id="tp-top-right">You are welcome</span> <div id="tp-tab"> <h1>Testing text</h1> </div> </div> Expected Keywords: keywords:You keywords:are keywords:welcome keywords:Testing keywords:text Is there anyway I can make them not to be one of the keywords? -- View this message in context: http://www.nabble.com/How-to-not-tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]