Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

Doug Cutting Fri, 03 Jun 2005 10:06:40 -0700

Fred Toth wrote:

I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:
Given this input:
<html><head><title>Howdy there</title></head><body>Helloworld</body></html>
An HTMLTokenizer would deliver something like this sort of token stream
(the numbers represent the start/end offsets for the token):

TAG, <html>, 0, 6
TAG, <head>, 6, 12
TAG, <title>, 12, 18
WORD, Howdy, 18, 22
WORD, there, 23, 28
TAG, </title>, 28, 36
etc.
Given the above, a filter could then strip out the HTML, but pass theWORDs onto Lucene, preserving the offsets in the source file. These would beused laterduring highlighting. Clever filters could be selective about what getsstripped and
what gets passed on.

For what it's worth, I think that's a good design and would love to seethis as a contribution.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

Reply via email to