Konstantin Kolinko wrote:
2008/9/12 André Warnier <[EMAIL PROTECTED]>
Konstantin Kolinko wrote:
2008/9/12 André Warnier <[EMAIL PROTECTED]>:
Caldarale, Charles R wrote:
I'm not sure these days what the "normal web character set" really is.
If
you're referring to ASCII (aka Basic Latin), then no, the Pound Sterling
symbol is not present. However, for any of the ISO-8859-x variants, it
is
present, using the 163 (0xA3) value you noted (same as the Unicode code
point). It's also in UTF-8 of course, but requires two bytes (0xC2
0xA3) to
represent the code point.
I love these discussions about character sets. They seem to confuse so
many
people; even I, who have been involved in them for 30 years...
Anyway, I have a related question, which I don't think constitutes a
hijack
of this thread, because the underlying cause is probably similar.
Here it goes :
Tomcat (v 4.1, v 5.0, v5.5, have not tried yet in 6.x)
The above Tomcat's running under the same Linux or Solaris, essentially
set
up the same way. The JVM may vary, but I don't think that is the problem,
because of the consistency of the problem as explained below.
I am running a webapp from an external supplier, always the same binary
version. I don't have the code, can't see what's in it.
The pages served by that webapp are the same html pages, all of them
having
a declaration <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">.
The pages also *are* properly encoded as iso-8859-1 (100% positive, I
know
the difference).
The browser receiving the pages is always the same one, same settings.
Now,
case a)
in the Tomcat startup files, I do nothing, meaning I just take Tomcat
out-of-the-box and run the webapp.
Result : in any such html page that contains characters with an ISO-8859
codepoint above \xA0 (meaning the displayable characters of the "high"
part
of the table, where one finds things like "uppercase A with umlaut"),
these
characters
- appear in the browser display as "?" (minus the quotes)
- also if I save the page from the browser to disk, and look at them
with
an iso-8859-1 capable editor, they are effectively "?".
(So it's not the browser misunderstanding them, it is Tomcat sending them
that way).
case b)
In one of the Tomcat startup files (e.g. tomcat_dir/bin/startup.sh or
even
in /etc/init.d/tomcat5.5), I add the following line
LC_CTYPE="en_us.iso88591"
(or whatever is valid on that host to specify an iso-8859-1 LC_CTYPE)
(before the actual start of Tomcat)
and restart Tomcat
then the same page displays properly in the browser, and also is correct
iso-8859-1 when saved to disk and examined with the editor.
(In other words, what previously were "?" characters, are now the correct
iso-8859-1 character bytes).
Now my question is :
How can it matter which LC_CTYPE Tomcat is started under, that would have
the result above ?
The behaviour above is consistent across different hosts, across the same
or
different Tomcat versions, it is always the same webapp, always the same
html pages, always the same browser, etc. Only that LC_CTYPE line
changes
the behaviour.
On the face of it, the only thing I can think of that would explain this,
is
that the webapp in question does something wrong, but what exactly could
it
be doing ?
Any ideas ?
It is <[EMAIL PROTECTED] pageEncoding="..." %> that is missing from those pages.
Thus JSP compiler does not know what encoding they are using for their
source and messes them at compilation time.
[...]
But these pages, as far as Tomcat and the webapp are concerned, are not
dynamic
in any way. They are straight static html pages.
So is the JSP stuff relevant ?
(I'm genuinely asking, since I know nothing about JSP pages)
The static HTML pages, as well as all the other static files, are served by
the
DefaultServlet. You should dig there. I think that fileEncoding
initialization parameter
of the servlet, as well as <mime-mapping> settings in web.xml come into
play.
JSP settings are irrelevant for them, of course.
Hi.
Thanks for the intent and answer above.
But I insist : these html pages are served by that webapp of which I am
talking, not by the DefaultServlet.
Those pages are being accessed via URLs like
http://myhost.mycompany.com/myservlet?..(additional parameters
indicating which static file to serve)..
It is on the way through that servlet that they get "corrupted", unless
I start Tomcat with LC_CTYPE="iso-8859-1".
That servlet, in its own web.xml config file in
tomcat_dir/webapps/myservlet/WEB-INF/web.xml, has no fileEncoding nor
mime-mapping section nor parameter.
So my question remains, I think : what could be going on in that servlet
so that :
- if LC_CTYPE is not set in the environment *of Tomcat* when it starts,
the upper iso-8859-1 characters in the pages are replaced by "?"
- if LC_CTYPE is set to "iso-8859-1" in the Tomcat environment when it
starts, then the pages delivered by the servlet are correct
?
I am not very qualified in Java, but could it be something like :
- the servlet reads those documents with some InputStream, without
specifying a character set or encoding, and by default that means to use
Tomcat's idea of its default LC_CTYPE for those InputStreams ?
- or the servlet outputs the document via an OutputStream without
specifying an encoding etc..
?
André
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]