Glad that hint was useful. I was totally bit by that artifact myself. It turns out that there were XML numeric character references within VARCHAR fields in a database I was indexing, so I never suspected that the NCRs I was seeing in Luke had anything to do with the non-XML non-HTML (so I thought) source data.
Also take care that when fields are stored it is quite easy to get confused between the stored values, which aren't analyzed, and the indexed tokens, which obviously are. Asking Luke to reconstruct a source document from its indexed tokens is a great way to see it from an "index-eye" view, which can be very revealing. - J.J. At 11:36 AM -0500 12/13/05, Robert Watkins wrote: >Aha! I had, indeed, been fooled by Luke into thinking that the entities >had been converted upon analysis, but you have set me straight. > >Thanks, >-- Robert > >On Tue, 13 Dec 2005, J.J. Larrea wrote: > >>Beware of HTML/XML entities in your input stream! The Lucene analyzers >>(including StandardAnalyzer) do not interpret these representation-specific >>encodings, and assume the & and ; delimiters are punctuation. How they deal >>with punctuation depends on the specific Analyzer logic. >> >>[ snipped ] >> >>PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. >>é to display non-ASCII characters, allowing one to be easily confused as >>to whether the NCRs were indexed or the Unicode characters were indexed. >> > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]