Re: html parsers and numers of terms

J.J. Larrea Tue, 13 Dec 2005 08:59:03 -0800

Glad that hint was useful.  I was totally bit by that artifact myself.  It 
turns out that there were XML numeric character references within VARCHAR 
fields in a database I was indexing, so I never suspected that the NCRs I was 
seeing in Luke had anything to do with the non-XML non-HTML (so I thought) 
source data.


Also take care that when fields are stored it is quite easy to get confused 
between the stored values, which aren't analyzed, and the indexed tokens, which 
obviously are.  Asking Luke to reconstruct a source document from its indexed 
tokens is a great way to see it from an "index-eye" view, which can be very 
revealing.

- J.J.

At 11:36 AM -0500 12/13/05, Robert Watkins wrote:
>Aha! I had, indeed, been fooled by Luke into thinking that the entities
>had been converted upon analysis, but you have set me straight.
>
>Thanks,
>-- Robert
>
>On Tue, 13 Dec 2005, J.J. Larrea wrote:
>
>>Beware of HTML/XML entities in your input stream!  The Lucene analyzers 
>>(including StandardAnalyzer) do not interpret these representation-specific 
>>encodings, and assume the & and ; delimiters are punctuation.  How they deal 
>>with punctuation depends on the specific Analyzer logic.
>>
>>[ snipped ]
>>
>>PS: Also note that when using Luke to see what is indexed, it uses NCRs eg. 
>>&#233; to display non-ASCII characters, allowing one to be easily confused as 
>>to whether the NCRs were indexed or the Unicode characters were indexed.
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: html parsers and numers of terms

Reply via email to