John Cherouvim wrote:
Hello
I'm having some problems indexing my UTF-8 html pages. I am running
lucene on Linux and I cannot understand why does the index generated
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this
to en_US the index generated will be different. Why is this the case? My
HTMLs are all UTF-8.
I think the difference comes from the default character encoding, if the
page is NOT clearly marked as UTF-8 - then the system has to guess, and
it guesses differently depending on the current locale.
Also, is there a lucene index browser? I am currently using Luke, which
is good but it doesn't show the Greek UTF-8 from within the index
correctly. Is this a matter of a setting in Luke?
It's a matter of setting the appropriate font in Settings.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]