Re: Reading Linux filenames in a way that will map back the same on open?

Dan Stromberg Sat, 13 Sep 2008 18:49:41 -0700


It still errors with a file not found:


+ LC_ALL=en_US.ISO8859-1
+ export LC_ALL
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -jar equivs.jar equivs.main

java.io.FileNotFoundException: /home/dstromberg/Sound/Music/VariousArtists/Dreamland/11 - Canci??n Para Dormir a un Ni??o (Argentina).flac(No such file or directory)

       at java.io.FileInputStream.open(Native Method)
       at java.io.FileInputStream.<init>(FileInputStream.java:106)
       at java.io.FileInputStream.<init>(FileInputStream.java:66)
       at Sortable_file.get_prefix(Sortable_file.java:56)
       at Sortable_file.compareTo(Sortable_file.java:159)
       at Sortable_file.compareTo(Sortable_file.java:1)
       at java.util.Arrays.mergeSort(Arrays.java:1167)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1156)
       at java.util.Arrays.mergeSort(Arrays.java:1156)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.mergeSort(Arrays.java:1155)
       at java.util.Arrays.sort(Arrays.java:1079)
       at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1

...and the foo.java program gives:

$ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo
sun.jnu.encoding=ANSI_X3.4-1968
file.encoding=ANSI_X3.4-1968
default locale=en_US

Thanks folks.

Xueming Shen wrote:

Martin, don't trap people into using -Dfile.encoding, always treat itas a read only property:-)
I believe initializeEncoding(env) gets invoked before -Dxyz=abcoverwrites the default one,beside the "jnu encoding" is introduced in 6.0, so we no longer lookfile.encoding since, I believe
you "ARE" the reviewer:-)
Dan, I kind of feel, switch the locale to a sio8859-1 locale in yourwrapper, for example
LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jarequivs.main
should work, if it does not, can you try to run

LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo

with Foo.java
System.out.println("sun.jnu.encoding=" +System.getProperty("sun.jnu.encoding"));System.out.println("file.encoding=" +System.getProperty("file.encoding"));
System.out.println("default locale=" + java.util.Locale.getDefault());

Let us know the result?

sherman


Martin Buchholz wrote:
On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <[EMAIL PROTECTED]> wrote:
Would you believe that I'm getting file not found errors even with
ISO-8859-1?
The software world is full of suprises.

Try
export LANG=C LC_ALL=C LC_CTYPE=C
java ... -Dfile.encoding=ISO-8859-1 ...

You could also be explicit about the
encoding used when doing any kind of char<->byte
conversion, e.g. reading from stdin or writing to stdout.

Oh, and this is only traditional Unix systems like
Linux and Solaris.  Windows and MacOSX
(at least should) act very differently in this area.

Martin
(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system.  I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).



This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"
find 'test-files' -type f -print | java -Xmx512M-Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main
find ~/Sound/Music -type f -print | java -Xmx512M-Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main



I'm reading the filenames like (please forgive the weird indentation) :

try                                                        {

while((line = stdin.readLine()) != null)
       {
                           // System.out.println(line);
          // System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println("**** exception " + e);
e.printStackTrace();}
Where Sortable_file's constructor just looks like:

  public Sortable_file(String filename)
     {
     this.filename = filename;
     /*
Java doesn't have a stat function without doing some fancystuff, so we
skip this
     optimization.  It really only helps with hard links anyway.
     this.device = -1
     this.inode = -1
     */
     File file = new File(this.filename);
     this.size = file.length();
     // It bothers a little that we can't close this, but perhaps it's
unnecessary.  That'll
     // be determined in large tests.
     // file.close();
     this.have_prefix = false;
     this.have_hash = false;
     }



..and the part that actually blows up looks like:

  private void get_prefix()
     {
     byte[] buffer = new byte[128];
     try
        {
        // The next line is the one that gives file not found
        FileInputStream file = new FileInputStream(this.filename);
        file.read(buffer);
// System.out.println("this.prefix.length " +this.prefix.length);
        file.close();
        }
     catch (IOException ioe)
        {
        // System.out.println( "IO error: " + ioe );
        ioe.printStackTrace();
        System.exit(1);
        }
     this.prefix = new String(buffer);
     this.have_prefix = true;
     }
Interestingly, it's already tried to get the file's length withoutan error
when it goes to read data from the file and has trouble.

I don't -think- I'm doing anything screwy in there - could it be that
ISO-8859-1 isn't giving good round-trip conversions in practice?Would thisbe an attribute of the java runtime in question, or could it be amatter ofthe locale files on my Ubuntu system being a little off? It wouldseem thelocale files would be a better explanation (or a bug in my programI'm not
seeing!), since I get the same errors with both OpenJDK and gcj.

Martin Buchholz wrote:
ISO-8859-1 guarantees round-trip conversion between bytes and chars,
guarateeing no loss of data, or getting apparently impossiblesituations
where the JDK gives you a list of files in a directory, but you get
File not found when you try to open them.

If you want to show the file names to users, you can always take
your ISO-8859-1 decoded strings, turn them back into byte[],
and decode using UTF-8 later, if you so desired.
(The basic OS interfaces in the JDK are not so flexible.
They are hard-coded to use the one charset specified by file.encoding)

Martin

On Wed, Sep 10, 2008 at 14:54, Naoto Sato <[EMAIL PROTECTED]> wrote:
Why ISO-8859-1? CJK filenames are guaranteed to fail in thatcase. I'drather choose UTF-8, as the default encoding on recent Unix/Linuxare all
UTF-8 so the filenames are likely in UTF-8.

Naoto

Martin Buchholz wrote:
Java made the decision to use String as an abstraction
for many OS-specific objects, like filenames (or environmentvariables).
Most of the time this works fine, but occasionally you can notice
that the underlying OS (in the case of Unix) actually uses
arbitrary byte arrays as filenames.

It would have been much more confusing to provide an interface
to filenames that is sometimes a sequence of char, sometimes a
sequence of byte.

So this is unlikely to change.

But if all you want is reliable reversible conversion,
using java -Dfile.encoding=ISO-8859-1
should do the trick.

Martin
On Tue, Sep 9, 2008 at 17:39, Dan Stromberg<[EMAIL PROTECTED]>
wrote:
Sorry if this is the wrong list for this question. I triedasking it
on comp.lang.java, but didn't get very far there.

I've been wanting to expand my horizons a bit by taking one of my
programs and rewriting it into a number of other languages.  It
started life in python, and I've recoded it into perl
(http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html).
Next on my list is java.  After that I'll probably do Haskell and
Eiffel/Sather.
So the python and perl versions were pretty easy, but I'mfinding that
the java version has a somewhat solution-resistant problem with
non-ASCII filenames.

The program just reads filenames from stdin (usually generated with
the *ix find command), and then compares those files, dividingthem up
into equal groups.
The problem with the java version, which manifests both withOpenJDKand gcj, is that the filenames being read from disk are 8 bit,and thefilenames opened by the OpenJDK JVM or gcj-compiled binary are 8bit,but as far as the java language is concerned, those filenamesare madeup of 16 bit characters. That's fine, but going from 8 to 16bit and
back to 8 bit seems to be non-information-preserving in this case,
which isn't so fine - I can clearly see the program, in an strace,
reading with one sequence of bytes, but then trying to open
another-though-related sequence of bytes. To be perfectlyclear: It's
getting file not found errors.
By playing with $LC_ALL, $LC_CTYPE and $LANG, it appears I canget the
program to handle files with one encoding, but not another.  I've
tried a bunch of values in these variables, includingISO-8859-1, C,
POSIX, UTF-8, and so on.

Is there such a thing as a filename encoding that will map 8 bit
filenames to 16 bit characters, but only using the low 8 bits ofthose16, and then map back to 8 bit filenames only using those low 8bits
again?
Is there some other way of making a Java program on Linux ableto read
filenames from stdin and later open those filenames?

Thanks!
--
Naoto Sato

Re: Reading Linux filenames in a way that will map back the same on open?

Reply via email to