Fred Toth wrote:
I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:

Given this input:

<html><head><title>Howdy there</title></head><body>Hello world</body></html>

An HTMLTokenizer would deliver something like this sort of token stream
(the numbers represent the start/end offsets for the token):

TAG, <html>, 0, 6
TAG, <head>, 6, 12
TAG, <title>, 12, 18
WORD, Howdy, 18, 22
WORD, there, 23, 28
TAG, </title>, 28, 36
etc.

Given the above, a filter could then strip out the HTML, but pass the WORDs on to Lucene, preserving the offsets in the source file. These would be used later during highlighting. Clever filters could be selective about what gets stripped and
what gets passed on.

For what it's worth, I think that's a good design and would love to see this as a contribution.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to