Fred Toth wrote:
I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:
Given this input:
<html><head><title>Howdy there</title></head><body>Hello
world</body></html>
An HTMLTokenizer would deliver something like this sort of token stream
(the numbers represent the start/end offsets for the token):
TAG, <html>, 0, 6
TAG, <head>, 6, 12
TAG, <title>, 12, 18
WORD, Howdy, 18, 22
WORD, there, 23, 28
TAG, </title>, 28, 36
etc.
Given the above, a filter could then strip out the HTML, but pass the
WORDs on
to Lucene, preserving the offsets in the source file. These would be
used later
during highlighting. Clever filters could be selective about what gets
stripped and
what gets passed on.
For what it's worth, I think that's a good design and would love to see
this as a contribution.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]