Re: Create FileInputStream in servlet from remote file with accentuated character name

André Warnier Tue, 22 Sep 2009 01:01:31 -0700

Christopher Schultz wrote:
...


What is the source of that file name? Is it hard-coded into your Java
code? If so, how? Did you just type "fichié.txt" into your .java file,
or did you use "\uxyz" syntax to specify the UNICODE character you intended?

If you are reading the filename from a remote client, then all the
request URI encodings and all that stuff are definitely relevant (ion
spite of my previous statements to the contrary).

...

Honestly, I think the above should not be a problem.

...
Christopher,

what I am trying to say is that such matters are horrible, because*everything* matters.

One cannot even be sure that the logfile message, as seen by the userand as pasted in the email to the list, and as further seen by thereader on this list, is really how the message is physically stored inthe logfile. That's because in-between, there can be umpteen layers ofdecoding/encoding which can make matters really confusing.(Even the encoding used by the process which writes the logfile maymatter, because "fichié.txt" may already have been re-encoded right there.)

Your note about making sure, in the source code of the program, that thefilename is really made out of the bytes which the OP thinks it is madeof, is a good example. If, to create this program source, one uses aneditor which is set to save its files in the iso-latin-1 charset, then"fichié.txt" will be saved, in the program source, as a string of 10bytes. Conversely, if one uses an editor set to save its files inUnicode/UTF-8, then this same string will be saved as 11 bytes (the "é"occupying 2 bytes).

Then comes the compiler..

I don't know how a Java compiler handles source code respectively savedas an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does ittell the difference ? does it make assumptions based on the locale it isrunning under ?


About the creation and subsequent "finding" of a file :

Generally-speaking, filesystems are "encoding agnostic", in the precisesense that :- if on a given platform and with a given programming language, youarrange for a string variable S to contain a precise series of bytes(for example, the UTF-8 encoding of the string "fichié.txt", 11 bytes long)- if you then use that variable as the name of a file which you createon disk- then no matter where this file directory ultimately resides, the nameof the file in it will generally be these same exact 11 bytes.- if you then, from the same platform and using the same programminglanguages, use this same variable A as the name of a file which you tryto open, it will work.

However, as soon as you deviate from the strict case above, what looksto you like "fichié.txt" /may/ not be the same series of bytes anymore,and that's where the problems start.

How the filename will "look" like is however another matter, dependingon what you use to display it and from where you do it.

In the case of Sylvie (and I am talking here about the final issue sheis trying to handle, not just about the test case)

- presumably, some (other) users and/or applications, running on some(other) platform and using some (other) tools, are creating files insideof a Windows host's directory.One item of interest here would be to know how these files are created,and if that process is consistent (meaning, are these files alwayscreated by the same programs, running always on the same platform, usingthe same encoding etc..). That is to make sure that when a file named"fichié.txt" is created there by whatever, it will always be created thesame way, with a name of either 10 or 11 bytes (it does not matterwhich, just that it be consistent).

- then, some program created by Sylvie, has to access that directory,and pick up files from there. So this program may have to "know" how afilename "fichié.txt" will be encoded in that directory (either as 10 or11 bytes). It also does not matter which, as long as Sylvie's programhas a way to consistently "spell" this name correctly.

The problem is generally unsolvable, if the original entry in thedirectory can be created in several ways, because there are multipleagents capable of creating it, and these agents use inconsistent encodings.

The issue can be simpler, if Sylvie's program just opens the directory,reads the filenames that it finds there (whatever their encoding is),into some variable, and then just uses this variable as the filename toopen the file and that's it.But if, in Sylvie's program, the filename itself has to be compared tosome pre-defined other string stored in the program, and some actiontaken or not whether it is considered equal or not, then there may be aproblem.

Yet another aspect to consider, is whether Sylvie is really testing theright thing.For instance, when Sylvie runs her Java test program, she does this frominside a Linux session, which is set for a specific "locale".However, the Tomcat server may well be started under a different localesetting, and this may have an impact as to how each one of them looks atthe filename "fichié.txt".(And also, as you mention, it depends how this string "fichié.txt" gets/into/ the program.)

Then of course, after the above trivial matter of the filename isresolved, one may have to tackle the matter of how the file contents areencoded.

:-)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Create FileInputStream in servlet from remote file with accentuated character name

Reply via email to