ISO-8859-1 guarantees round-trip conversion between bytes and chars, guarateeing no loss of data, or getting apparently impossible situations where the JDK gives you a list of files in a directory, but you get File not found when you try to open them.
If you want to show the file names to users, you can always take your ISO-8859-1 decoded strings, turn them back into byte[], and decode using UTF-8 later, if you so desired. (The basic OS interfaces in the JDK are not so flexible. They are hard-coded to use the one charset specified by file.encoding) Martin On Wed, Sep 10, 2008 at 14:54, Naoto Sato <[EMAIL PROTECTED]> wrote: > Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd > rather choose UTF-8, as the default encoding on recent Unix/Linux are all > UTF-8 so the filenames are likely in UTF-8. > > Naoto > > Martin Buchholz wrote: >> >> Java made the decision to use String as an abstraction >> for many OS-specific objects, like filenames (or environment variables). >> Most of the time this works fine, but occasionally you can notice >> that the underlying OS (in the case of Unix) actually uses >> arbitrary byte arrays as filenames. >> >> It would have been much more confusing to provide an interface >> to filenames that is sometimes a sequence of char, sometimes a >> sequence of byte. >> >> So this is unlikely to change. >> >> But if all you want is reliable reversible conversion, >> using java -Dfile.encoding=ISO-8859-1 >> should do the trick. >> >> Martin >> >> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <[EMAIL PROTECTED]> >> wrote: >>> >>> Sorry if this is the wrong list for this question. I tried asking it >>> on comp.lang.java, but didn't get very far there. >>> >>> I've been wanting to expand my horizons a bit by taking one of my >>> programs and rewriting it into a number of other languages. It >>> started life in python, and I've recoded it into perl >>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>> Next on my list is java. After that I'll probably do Haskell and >>> Eiffel/Sather. >>> >>> So the python and perl versions were pretty easy, but I'm finding that >>> the java version has a somewhat solution-resistant problem with >>> non-ASCII filenames. >>> >>> The program just reads filenames from stdin (usually generated with >>> the *ix find command), and then compares those files, dividing them up >>> into equal groups. >>> >>> The problem with the java version, which manifests both with OpenJDK >>> and gcj, is that the filenames being read from disk are 8 bit, and the >>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>> but as far as the java language is concerned, those filenames are made >>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>> back to 8 bit seems to be non-information-preserving in this case, >>> which isn't so fine - I can clearly see the program, in an strace, >>> reading with one sequence of bytes, but then trying to open >>> another-though-related sequence of bytes. To be perfectly clear: It's >>> getting file not found errors. >>> >>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>> program to handle files with one encoding, but not another. I've >>> tried a bunch of values in these variables, including ISO-8859-1, C, >>> POSIX, UTF-8, and so on. >>> >>> Is there such a thing as a filename encoding that will map 8 bit >>> filenames to 16 bit characters, but only using the low 8 bits of those >>> 16, and then map back to 8 bit filenames only using those low 8 bits >>> again? >>> >>> Is there some other way of making a Java program on Linux able to read >>> filenames from stdin and later open those filenames? >>> >>> Thanks! >>> > > > -- > Naoto Sato >