Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Dan Stromberg


Sadly, I'm still getting ghost files with C and ISO-8859-1:

./wrapper
+ case 3 in
+ export LC_ALL=C
+ LC_ALL=C
+ export LC_CTYPE=C
+ LC_CTYPE=C
+ export LANG=C
+ LANG=C
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -Dfile.encoding=ISO-8859-1 -jar equivs.jar equivs.main
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
Artists/Dreamland/11 - Canción Para Dormir a un Niño (Argentina).flac 
(No such file or directory)

at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:106)
at java.io.FileInputStream.(FileInputStream.java:66)
at Sortable_file.get_prefix(Sortable_file.java:56)
at Sortable_file.compareTo(Sortable_file.java:159)
at Sortable_file.compareTo(Sortable_file.java:1)
at java.util.Arrays.mergeSort(Arrays.java:1167)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1156)
at java.util.Arrays.mergeSort(Arrays.java:1156)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.sort(Arrays.java:1079)
at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1


Martin Buchholz wrote:

On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <[EMAIL PROTECTED]> wrote:


Would you believe that I'm getting file not found errors even with
ISO-8859-1?


The software world is full of suprises.

Try
export LANG=C LC_ALL=C LC_CTYPE=C
java ... -Dfile.encoding=ISO-8859-1 ...

You could also be explicit about the
encoding used when doing any kind of char<->byte
conversion, e.g. reading from stdin or writing to stdout.

Oh, and this is only traditional Unix systems like
Linux and Solaris.  Windows and MacOSX
(at least should) act very differently in this area.

Martin


(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system.  I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).



This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"

find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main

find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main



I'm reading the filenames like (please forgive the weird indentation) :

try{

while((line = stdin.readLine()) != null)
   {
   // System.out.println(line);
  // System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println(" exception " + e);
e.printStackTrace(); }



Where Sortable_file's constructor just looks like:

  public Sortable_file(String filename)
 {
 this.filename = filename;
 /*
 Java doesn't have a stat function without doing some fancy stuff, so we
skip this
 optimization.  It really only helps with hard links anyway.
 this.device = -1
 this.inode = -1
 */
 File file = new File(this.filename);
 this.size = file.length();
 // It bothers a little that we can't close this, but perhaps it's
unnecessary.  That'll
 // be determined in large tests.
 // file.close();
 this.have_prefix = false;
 this.have_hash = false;
 }



..and the part that actually blows up looks like:

  private void get_prefix()
 {
 byte[] buffer = new byte[128];
 try
{
// The next line is the one that gives file not found
FileInputStream file = new FileInputStream(this.filename);
file.read(buffer);
// System.out.println("this.prefix.length " + this.prefix.length);
file.close();
}
 catch (IOException ioe)
{
// System.out.println( "IO error: " + ioe );
ioe.printStackTrace();
System.exit(1);
}
 this.prefix = new String(buffer);
 this.have_prefix = true;
 }



Interestingly, it's already tried to get the file's length without an error
when it goes to read data from the file and has trouble.

I don't -think- I'm doing anything screwy in there - could it be that
ISO-8859-1 isn't giving good round-trip conversions in practice?  Would this
be an attribute of the java runtime in question, or could it be a matter of
the locale files on my Ubuntu system being a little off?  It would seem the
locale files would be a better explanation (or a bug in my pr

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Xueming Shen


Martin, don't trap people into using -Dfile.encoding, always treat it as 
a read only property:-)


I believe initializeEncoding(env) gets invoked before -Dxyz=abc  
overwrites the default one,
beside the "jnu encoding" is introduced in 6.0, so we no longer look 
file.encoding since, I believe

you "ARE" the reviewer:-)

Dan, I kind of feel, switch the locale to a sio8859-1 locale in your 
wrapper, for example


LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar 
equivs.main


should work, if it does not, can you try to run

LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo

with Foo.java

System.out.println("sun.jnu.encoding=" + 
System.getProperty("sun.jnu.encoding"));

System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("default locale=" + java.util.Locale.getDefault());

Let us know the result?

sherman


Martin Buchholz wrote:

On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <[EMAIL PROTECTED]> wrote:
  

Would you believe that I'm getting file not found errors even with
ISO-8859-1?



The software world is full of suprises.

Try
export LANG=C LC_ALL=C LC_CTYPE=C
java ... -Dfile.encoding=ISO-8859-1 ...

You could also be explicit about the
encoding used when doing any kind of char<->byte
conversion, e.g. reading from stdin or writing to stdout.

Oh, and this is only traditional Unix systems like
Linux and Solaris.  Windows and MacOSX
(at least should) act very differently in this area.

Martin

  

(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system.  I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).



This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"

find 'test-files' -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main

find ~/Sound/Music -type f -print | java -Xmx512M -Dfile.encoding=ISO-8859-1
-jar equivs.jar equivs.main



I'm reading the filenames like (please forgive the weird indentation) :

try{

while((line = stdin.readLine()) != null)
   {
   // System.out.println(line);
  // System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println(" exception " + e);
e.printStackTrace(); }



Where Sortable_file's constructor just looks like:

  public Sortable_file(String filename)
 {
 this.filename = filename;
 /*
 Java doesn't have a stat function without doing some fancy stuff, so we
skip this
 optimization.  It really only helps with hard links anyway.
 this.device = -1
 this.inode = -1
 */
 File file = new File(this.filename);
 this.size = file.length();
 // It bothers a little that we can't close this, but perhaps it's
unnecessary.  That'll
 // be determined in large tests.
 // file.close();
 this.have_prefix = false;
 this.have_hash = false;
 }



..and the part that actually blows up looks like:

  private void get_prefix()
 {
 byte[] buffer = new byte[128];
 try
{
// The next line is the one that gives file not found
FileInputStream file = new FileInputStream(this.filename);
file.read(buffer);
// System.out.println("this.prefix.length " + this.prefix.length);
file.close();
}
 catch (IOException ioe)
{
// System.out.println( "IO error: " + ioe );
ioe.printStackTrace();
System.exit(1);
}
 this.prefix = new String(buffer);
 this.have_prefix = true;
 }



Interestingly, it's already tried to get the file's length without an error
when it goes to read data from the file and has trouble.

I don't -think- I'm doing anything screwy in there - could it be that
ISO-8859-1 isn't giving good round-trip conversions in practice?  Would this
be an attribute of the java runtime in question, or could it be a matter of
the locale files on my Ubuntu system being a little off?  It would seem the
locale files would be a better explanation (or a bug in my program I'm not
seeing!), since I get the same errors with both OpenJDK and gcj.

Martin Buchholz wrote:


ISO-8859-1 guarantees round-trip conversion between bytes and chars,
guarateeing no loss of data, or getting apparently impossible situations
where the JDK gives you a list of files in a directory, but you get
File not found when you try to open them.

If you want to show the file names to users, you can always take
your ISO-8859-1 decoded strings, turn them back into byte[],
and decode using UTF-8 later, if you so desired.
(The basic OS interfaces in the JDK are not so flexible.
They are har

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Dan Stromberg


It still errors with a file not found:

+ LC_ALL=en_US.ISO8859-1
+ export LC_ALL
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -jar equivs.jar equivs.main
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o (Argentina).flac 
(No such file or directory)

   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.(FileInputStream.java:106)
   at java.io.FileInputStream.(FileInputStream.java:66)
   at Sortable_file.get_prefix(Sortable_file.java:56)
   at Sortable_file.compareTo(Sortable_file.java:159)
   at Sortable_file.compareTo(Sortable_file.java:1)
   at java.util.Arrays.mergeSort(Arrays.java:1167)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1156)
   at java.util.Arrays.mergeSort(Arrays.java:1156)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.sort(Arrays.java:1079)
   at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1

...and the foo.java program gives:

$ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo
sun.jnu.encoding=ANSI_X3.4-1968
file.encoding=ANSI_X3.4-1968
default locale=en_US

Thanks folks.

Xueming Shen wrote:


Martin, don't trap people into using -Dfile.encoding, always treat it 
as a read only property:-)


I believe initializeEncoding(env) gets invoked before -Dxyz=abc  
overwrites the default one,
beside the "jnu encoding" is introduced in 6.0, so we no longer look 
file.encoding since, I believe

you "ARE" the reviewer:-)

Dan, I kind of feel, switch the locale to a sio8859-1 locale in your 
wrapper, for example


LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar 
equivs.main


should work, if it does not, can you try to run

LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo

with Foo.java

System.out.println("sun.jnu.encoding=" + 
System.getProperty("sun.jnu.encoding"));
System.out.println("file.encoding=" + 
System.getProperty("file.encoding"));

System.out.println("default locale=" + java.util.Locale.getDefault());

Let us know the result?

sherman


Martin Buchholz wrote:

On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <[EMAIL PROTECTED]> wrote:
 

Would you believe that I'm getting file not found errors even with
ISO-8859-1?



The software world is full of suprises.

Try
export LANG=C LC_ALL=C LC_CTYPE=C
java ... -Dfile.encoding=ISO-8859-1 ...

You could also be explicit about the
encoding used when doing any kind of char<->byte
conversion, e.g. reading from stdin or writing to stdout.

Oh, and this is only traditional Unix systems like
Linux and Solaris.  Windows and MacOSX
(at least should) act very differently in this area.

Martin

 

(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system.  I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).



This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"

find 'test-files' -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1

-jar equivs.jar equivs.main

find ~/Sound/Music -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1

-jar equivs.jar equivs.main



I'm reading the filenames like (please forgive the weird indentation) :

try{

while((line = stdin.readLine()) != null)
   {
   // System.out.println(line);
  // System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println(" exception " + e);
e.printStackTrace(); 
}




Where Sortable_file's constructor just looks like:

  public Sortable_file(String filename)
 {
 this.filename = filename;
 /*
 Java doesn't have a stat function without doing some fancy 
stuff, so we

skip this
 optimization.  It really only helps with hard links anyway.
 this.device = -1
 this.inode = -1
 */
 File file = new File(this.filename);
 this.size = file.length();
 // It bothers a little that we can't close this, but perhaps it's
unnecessary.  That'll
 // be determined in large tests.
 // file.close();
 this.have_prefix = false;
 this.have_hash = false;
 }



..and the part that actually blows up looks like:

  private void get_prefix()
 {
 byte[] buffer = new byte[128];
 try
{
// The next line 

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Xueming Shen
Obviously your locale setting is not being "exported"...what "shell" are 
you using?


You can try to set your locale to en_US.ISO8859-1 explicitly at command 
line first,
type in "locale" to confirm that your locale is being set correctly to 
en_US.ISO8859-1,
then run the "find + java" to see if that FNF error disappears. If not, 
run the java Foo

again and tell us the result:-)

One possibility is that you don't have a ISO8859-1 locale installed at all?

Sherman

Dan Stromberg wrote:


It still errors with a file not found:

+ LC_ALL=en_US.ISO8859-1
+ export LC_ALL
+ find /home/dstromberg/Sound/Music -type f -print
+ java -Xmx512M -jar equivs.jar equivs.main
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/Various 
Artists/Dreamland/11 - Canci??n Para Dormir a un Ni??o 
(Argentina).flac (No such file or directory)

   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.(FileInputStream.java:106)
   at java.io.FileInputStream.(FileInputStream.java:66)
   at Sortable_file.get_prefix(Sortable_file.java:56)
   at Sortable_file.compareTo(Sortable_file.java:159)
   at Sortable_file.compareTo(Sortable_file.java:1)
   at java.util.Arrays.mergeSort(Arrays.java:1167)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1156)
   at java.util.Arrays.mergeSort(Arrays.java:1156)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.mergeSort(Arrays.java:1155)
   at java.util.Arrays.sort(Arrays.java:1079)
   at equivs.main(equivs.java:40)
make: *** [wrapped] Error 1

...and the foo.java program gives:

$ LC_ALL=en_US.ISO8859-1; export LC_ALL; java foo
sun.jnu.encoding=ANSI_X3.4-1968
file.encoding=ANSI_X3.4-1968
default locale=en_US

Thanks folks.

Xueming Shen wrote:


Martin, don't trap people into using -Dfile.encoding, always treat it 
as a read only property:-)


I believe initializeEncoding(env) gets invoked before -Dxyz=abc  
overwrites the default one,
beside the "jnu encoding" is introduced in 6.0, so we no longer look 
file.encoding since, I believe

you "ARE" the reviewer:-)

Dan, I kind of feel, switch the locale to a sio8859-1 locale in your 
wrapper, for example


LC_ALL=en_US.ISO8859-1; export LC_ALL; java -Xmx512M -jar equivs.jar 
equivs.main


should work, if it does not, can you try to run

LC_ALL=en_US.ISO8859-1; export LC_ALL; java Foo

with Foo.java

System.out.println("sun.jnu.encoding=" + 
System.getProperty("sun.jnu.encoding"));
System.out.println("file.encoding=" + 
System.getProperty("file.encoding"));

System.out.println("default locale=" + java.util.Locale.getDefault());

Let us know the result?

sherman


Martin Buchholz wrote:
On Wed, Sep 10, 2008 at 17:50, Dan Stromberg <[EMAIL PROTECTED]> 
wrote:
 

Would you believe that I'm getting file not found errors even with
ISO-8859-1?



The software world is full of suprises.

Try
export LANG=C LC_ALL=C LC_CTYPE=C
java ... -Dfile.encoding=ISO-8859-1 ...

You could also be explicit about the
encoding used when doing any kind of char<->byte
conversion, e.g. reading from stdin or writing to stdout.

Oh, and this is only traditional Unix systems like
Linux and Solaris.  Windows and MacOSX
(at least should) act very differently in this area.

Martin

 

(Naoto: My program doesn't know what encoding to expect - I'm afraid I
probably have different applications writing filenames in different
encodings on my Ubuntu system.  I'd been thinking I wanted to treat
filenames as just a sequence of bytes, and let the terminal emulator
interpret the encoding (hopefully) correctly on output).



This gives two file not found tracebacks:

export LC_ALL='ISO-8859-1'
export LC_CTYPE="$LC_ALL"
export LANG="$LC_ALL"

find 'test-files' -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1

-jar equivs.jar equivs.main

find ~/Sound/Music -type f -print | java -Xmx512M 
-Dfile.encoding=ISO-8859-1

-jar equivs.jar equivs.main



I'm reading the filenames like (please forgive the weird 
indentation) :


try{

while((line = stdin.readLine()) != null)
   {
   // System.out.println(line);
  // System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println(" exception " + e);
e.printStackTrace(); 
}




Where Sortable_file's constructor just looks like:

  public Sortable_file(String filename)
 {
 this.filename = filename;
 /*
 Java doesn't have a stat function without doing some fancy 
stuff, so we

skip this
 optimization.  It really only helps with hard links anyway.
 this.device = -

Re: Reading Linux filenames in a way that will map back the same on open?

2008-09-13 Thread Dan Stromberg

Xueming Shen wrote:
Obviously your locale setting is not being "exported"...what "shell" are 
you using?


It's bash.  I'm pretty sure it's exported, because env sees it, and env 
isn't a shell builtin in bash (at least not yet :).


You can try to set your locale to en_US.ISO8859-1 explicitly at command 
line first,
type in "locale" to confirm that your locale is being set correctly to 
en_US.ISO8859-1,


Good clue:

$ export LC_ALL=en_US.ISO8859-1
dstromberg-desktop-dstromberg:~/src/equivs-j i486-pc-linux-gnu 11433 - 
above cmd done 2008 Sat Sep 13 10:13 PM


$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LC_CTYPE="en_US.ISO8859-1"
LC_NUMERIC="en_US.ISO8859-1"
LC_TIME="en_US.ISO8859-1"
LC_COLLATE="en_US.ISO8859-1"
LC_MONETARY="en_US.ISO8859-1"
LC_MESSAGES="en_US.ISO8859-1"
LC_PAPER="en_US.ISO8859-1"
LC_NAME="en_US.ISO8859-1"
LC_ADDRESS="en_US.ISO8859-1"
LC_TELEPHONE="en_US.ISO8859-1"
LC_MEASUREMENT="en_US.ISO8859-1"
LC_IDENTIFICATION="en_US.ISO8859-1"
LC_ALL=en_US.ISO8859-1

It turned out I didn't have en_US.ISO-8859-1 configured on my system. 
So I used this URL to get it set up: 
http://ubuntuforums.org/showthread.php?t=423039  I didn't make it my 
default locale; I just made it a supported locale.


And now my program appears to work great, even with non-English 
filenames - thanks folks!