Aha! I had, indeed, been fooled by Luke into thinking that the entities
had been converted upon analysis, but you have set me straight.

Thanks,
-- Robert

On Tue, 13 Dec 2005, J.J. Larrea wrote:

Beware of HTML/XML entities in your input stream!  The Lucene analyzers (including 
StandardAnalyzer) do not interpret these representation-specific encodings, and 
assume the & and ; delimiters are punctuation.  How they deal with punctuation 
depends on the specific Analyzer logic.

[ snipped ]

PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. 
é to display non-ASCII characters, allowing one to be easily confused as 
to whether the NCRs were indexed or the Unicode characters were indexed.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to