Re: html parsers and numers of terms

Robert Watkins Tue, 13 Dec 2005 08:35:52 -0800

Aha! I had, indeed, been fooled by Luke into thinking that the entities
had been converted upon analysis, but you have set me straight.


Thanks,
-- Robert

On Tue, 13 Dec 2005, J.J. Larrea wrote:

Beware of HTML/XML entities in your input stream!  The Lucene analyzers (including 
StandardAnalyzer) do not interpret these representation-specific encodings, and 
assume the & and ; delimiters are punctuation.  How they deal with punctuation 
depends on the specific Analyzer logic.

[ snipped ]

PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. 
&#233; to display non-ASCII characters, allowing one to be easily confused as 
to whether the NCRs were indexed or the Unicode characters were indexed.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: html parsers and numers of terms

Reply via email to