Aha! I had, indeed, been fooled by Luke into thinking that the entities had been converted upon analysis, but you have set me straight.
Thanks, -- Robert On Tue, 13 Dec 2005, J.J. Larrea wrote:
Beware of HTML/XML entities in your input stream! The Lucene analyzers (including StandardAnalyzer) do not interpret these representation-specific encodings, and assume the & and ; delimiters are punctuation. How they deal with punctuation depends on the specific Analyzer logic. [ snipped ] PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. é to display non-ASCII characters, allowing one to be easily confused as to whether the NCRs were indexed or the Unicode characters were indexed.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]