Since the thread seems to be trailing off... Does anyone know of any mailing lists that might be more appropriate for this question?
Also, is there another OS I should try (perhaps in a little QEMU) for a point of comparison? Preferably something that also uses 8 bit filenames, but would have very different localization data and code other than the java runtimes themselves? Does FreeBSD fit this description? On Wed, Sep 10, 2008 at 5:52 PM, Dan Stromberg <[EMAIL PROTECTED]> wrote: > > Would you believe that I'm getting file not found errors even with > ISO-8859-1? > > (Naoto: My program doesn't know what encoding to expect - I'm afraid I > probably have different applications writing filenames in different > encodings on my Ubuntu system. I'd been thinking I wanted to treat > filenames as just a sequence of bytes, and let the terminal emulator > interpret the encoding (hopefully) correctly on output). > > > > This gives two file not found tracebacks: > > export LC_ALL='ISO-8859-1' > export LC_CTYPE="$LC_ALL" > export LANG="$LC_ALL" > > find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1 > -jar equivs.jar equivs.main > > > > I'm reading the filenames like (please forgive the weird indentation) : > > try { > > while((line = stdin.readLine()) != null) { > // System.out.println(line); > // System.out.flush(); > lst.add(new Sortable_file(line)); > } > } > catch(java.io.IOException e) > { > System.err.println("**** exception " + e); > e.printStackTrace(); } > > > > Where Sortable_file's constructor just looks like: > > public Sortable_file(String filename) > { > this.filename = filename; > /* > Java doesn't have a stat function without doing some fancy stuff, so we > skip this > optimization. It really only helps with hard links anyway. > this.device = -1 > this.inode = -1 > */ > File file = new File(this.filename); > this.size = file.length(); > // It bothers a little that we can't close this, but perhaps it's > unnecessary. That'll > // be determined in large tests. > // file.close(); > this.have_prefix = false; > this.have_hash = false; > } > > > > ..and the part that actually blows up looks like: > > private void get_prefix() > { > byte[] buffer = new byte[128]; > try > { > // The next line is the one that gives file not found > FileInputStream file = new FileInputStream(this.filename); > file.read(buffer); > // System.out.println("this.prefix.length " + this.prefix.length); > file.close(); > } > catch (IOException ioe) > { > // System.out.println( "IO error: " + ioe ); > ioe.printStackTrace(); > System.exit(1); > } > this.prefix = new String(buffer); > this.have_prefix = true; > } > > > > Interestingly, it's already tried to get the file's length without an error > when it goes to read data from the file and has trouble. > > I don't -think- I'm doing anything screwy in there - could it be that > ISO-8859-1 isn't giving good round-trip conversions in practice? Would this > be an attribute of the java runtime in question, or could it be a matter of > the locale files on my Ubuntu system being a little off? It would seem the > locale files would be a better explanation (or a bug in my program I'm not > seeing!), since I get the same errors with both OpenJDK and gcj. > > Martin Buchholz wrote: >> >> ISO-8859-1 guarantees round-trip conversion between bytes and chars, >> guarateeing no loss of data, or getting apparently impossible situations >> where the JDK gives you a list of files in a directory, but you get >> File not found when you try to open them. >> >> If you want to show the file names to users, you can always take >> your ISO-8859-1 decoded strings, turn them back into byte[], >> and decode using UTF-8 later, if you so desired. >> (The basic OS interfaces in the JDK are not so flexible. >> They are hard-coded to use the one charset specified by file.encoding) >> >> Martin >> >> On Wed, Sep 10, 2008 at 14:54, Naoto Sato <[EMAIL PROTECTED]> wrote: >>> >>> Why ISO-8859-1? CJK filenames are guaranteed to fail in that case. I'd >>> rather choose UTF-8, as the default encoding on recent Unix/Linux are all >>> UTF-8 so the filenames are likely in UTF-8. >>> >>> Naoto >>> >>> Martin Buchholz wrote: >>>> >>>> Java made the decision to use String as an abstraction >>>> for many OS-specific objects, like filenames (or environment variables). >>>> Most of the time this works fine, but occasionally you can notice >>>> that the underlying OS (in the case of Unix) actually uses >>>> arbitrary byte arrays as filenames. >>>> >>>> It would have been much more confusing to provide an interface >>>> to filenames that is sometimes a sequence of char, sometimes a >>>> sequence of byte. >>>> >>>> So this is unlikely to change. >>>> >>>> But if all you want is reliable reversible conversion, >>>> using java -Dfile.encoding=ISO-8859-1 >>>> should do the trick. >>>> >>>> Martin >>>> >>>> On Tue, Sep 9, 2008 at 17:39, Dan Stromberg <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Sorry if this is the wrong list for this question. I tried asking it >>>>> on comp.lang.java, but didn't get very far there. >>>>> >>>>> I've been wanting to expand my horizons a bit by taking one of my >>>>> programs and rewriting it into a number of other languages. It >>>>> started life in python, and I've recoded it into perl >>>>> (http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html). >>>>> Next on my list is java. After that I'll probably do Haskell and >>>>> Eiffel/Sather. >>>>> >>>>> So the python and perl versions were pretty easy, but I'm finding that >>>>> the java version has a somewhat solution-resistant problem with >>>>> non-ASCII filenames. >>>>> >>>>> The program just reads filenames from stdin (usually generated with >>>>> the *ix find command), and then compares those files, dividing them up >>>>> into equal groups. >>>>> >>>>> The problem with the java version, which manifests both with OpenJDK >>>>> and gcj, is that the filenames being read from disk are 8 bit, and the >>>>> filenames opened by the OpenJDK JVM or gcj-compiled binary are 8 bit, >>>>> but as far as the java language is concerned, those filenames are made >>>>> up of 16 bit characters. That's fine, but going from 8 to 16 bit and >>>>> back to 8 bit seems to be non-information-preserving in this case, >>>>> which isn't so fine - I can clearly see the program, in an strace, >>>>> reading with one sequence of bytes, but then trying to open >>>>> another-though-related sequence of bytes. To be perfectly clear: It's >>>>> getting file not found errors. >>>>> >>>>> By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I can get the >>>>> program to handle files with one encoding, but not another. I've >>>>> tried a bunch of values in these variables, including ISO-8859-1, C, >>>>> POSIX, UTF-8, and so on. >>>>> >>>>> Is there such a thing as a filename encoding that will map 8 bit >>>>> filenames to 16 bit characters, but only using the low 8 bits of those >>>>> 16, and then map back to 8 bit filenames only using those low 8 bits >>>>> again? >>>>> >>>>> Is there some other way of making a Java program on Linux able to read >>>>> filenames from stdin and later open those filenames? >>>>> >>>>> Thanks! >>>>> >>> >>> -- >>> Naoto Sato >>> > > >