Christopher Schultz wrote:
...
What is the source of that file name? Is it hard-coded into your Java
code? If so, how? Did you just type "fichié.txt" into your .java file,
or did you use "\uxyz" syntax to specify the UNICODE character you intended?
If you are reading the filename from a remote client, then all the
request URI encodings and all that stuff are definitely relevant (ion
spite of my previous statements to the contrary).
...
Honestly, I think the above should not be a problem.
...
Christopher,
what I am trying to say is that such matters are horrible, because
*everything* matters.
One cannot even be sure that the logfile message, as seen by the user
and as pasted in the email to the list, and as further seen by the
reader on this list, is really how the message is physically stored in
the logfile. That's because in-between, there can be umpteen layers of
decoding/encoding which can make matters really confusing.
(Even the encoding used by the process which writes the logfile may
matter, because "fichié.txt" may already have been re-encoded right there.)
Your note about making sure, in the source code of the program, that the
filename is really made out of the bytes which the OP thinks it is made
of, is a good example. If, to create this program source, one uses an
editor which is set to save its files in the iso-latin-1 charset, then
"fichié.txt" will be saved, in the program source, as a string of 10
bytes. Conversely, if one uses an editor set to save its files in
Unicode/UTF-8, then this same string will be saved as 11 bytes (the "é"
occupying 2 bytes).
Then comes the compiler..
I don't know how a Java compiler handles source code respectively saved
as an iso-8859-1 encoded file, or as a UTF-8 encoded file. How does it
tell the difference ? does it make assumptions based on the locale it is
running under ?
About the creation and subsequent "finding" of a file :
Generally-speaking, filesystems are "encoding agnostic", in the precise
sense that :
- if on a given platform and with a given programming language, you
arrange for a string variable S to contain a precise series of bytes
(for example, the UTF-8 encoding of the string "fichié.txt", 11 bytes long)
- if you then use that variable as the name of a file which you create
on disk
- then no matter where this file directory ultimately resides, the name
of the file in it will generally be these same exact 11 bytes.
- if you then, from the same platform and using the same programming
languages, use this same variable A as the name of a file which you try
to open, it will work.
However, as soon as you deviate from the strict case above, what looks
to you like "fichié.txt" /may/ not be the same series of bytes anymore,
and that's where the problems start.
How the filename will "look" like is however another matter, depending
on what you use to display it and from where you do it.
In the case of Sylvie (and I am talking here about the final issue she
is trying to handle, not just about the test case)
- presumably, some (other) users and/or applications, running on some
(other) platform and using some (other) tools, are creating files inside
of a Windows host's directory.
One item of interest here would be to know how these files are created,
and if that process is consistent (meaning, are these files always
created by the same programs, running always on the same platform, using
the same encoding etc..). That is to make sure that when a file named
"fichié.txt" is created there by whatever, it will always be created the
same way, with a name of either 10 or 11 bytes (it does not matter
which, just that it be consistent).
- then, some program created by Sylvie, has to access that directory,
and pick up files from there. So this program may have to "know" how a
filename "fichié.txt" will be encoded in that directory (either as 10 or
11 bytes). It also does not matter which, as long as Sylvie's program
has a way to consistently "spell" this name correctly.
The problem is generally unsolvable, if the original entry in the
directory can be created in several ways, because there are multiple
agents capable of creating it, and these agents use inconsistent encodings.
The issue can be simpler, if Sylvie's program just opens the directory,
reads the filenames that it finds there (whatever their encoding is),
into some variable, and then just uses this variable as the filename to
open the file and that's it.
But if, in Sylvie's program, the filename itself has to be compared to
some pre-defined other string stored in the program, and some action
taken or not whether it is considered equal or not, then there may be a
problem.
Yet another aspect to consider, is whether Sylvie is really testing the
right thing.
For instance, when Sylvie runs her Java test program, she does this from
inside a Linux session, which is set for a specific "locale".
However, the Tomcat server may well be started under a different locale
setting, and this may have an impact as to how each one of them looks at
the filename "fichié.txt".
(And also, as you mention, it depends how this string "fichié.txt" gets
/into/ the program.)
Then of course, after the above trivial matter of the filename is
resolved, one may have to tackle the matter of how the file contents are
encoded.
:-)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org