: I'm having some problems indexing my UTF-8 html pages. I am running : lucene on Linux and I cannot understand why does the index generated : depends on the locale of my operating system. : If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this : to en_US the index generated will be different. Why is this the case? My : HTMLs are all UTF-8.
To elaborate a little bit more on the comments other people have made, the differences you are seeing are most likely related to your JVM using the LANG variable to determine what the default charset will be when you open readers. You should look carefully at how you are opening the HTML files and reading them in. If you raen't specifying the Charset explicitly in your code, then you're getting the system default. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]