Daniel Henrique Alves Lima wrote
On Wed, 2009-07-08 at 18:14 +0200, André Warnier wrote:
6) In your application, you can decide to interpret this series of
bytes, as a string in the UTF-8 encoding, and decode it as such into
Unicode *characters*.
Forget about any parameters to specify the charset of URLs, they only
confuse things totally.
The only way you know what was the underlying encoding, is when you know
for sure that all URLs that will hit your server, come from a known
source of which you controlled the encoding.
?
To use an example :
Suppose you give me the URL to your webapp, and it is
http://your-server.somewhere.br/yourapp
Suppose I use this URL, and add a query string, so that it arrives to
your server as a GET request for
/yourapp?param=%45abcd%f3%b9123%c4%20xy
then, you have absolutely no way, after URL-decoding the above into a
series of bytes, to know under which character set I actually composed
that query string.
It /could be/, that the sequence %c4%20 that you see above, is actually
the UTF-8 encoding of a single Unicode character.(**)
But it could also be that in fact it is the two iso-8859-1 characters
"Ä" and "space".
And it could also be that, together with the "x" which follows, it is
the tri-byte encoding of the Klingon symbol for breakfast.(*)
In order to decide on an interpretation of that query string using a
certain character set and encoding, you would have to know something
about me and my browser, which on the WWW you don't know.
The only way you could /assume/ a certain character set and encoding,
would be if this request could only originate from a page that your
application sent to my browser beforehand, in which you have done your
best to ensure that whatever "click" results in a request with a known
charset and encoding.
That's why all the previous details are important.
Note that some people variously assume that a HTTP URL is necessarily
expressed in US-ASCII, or iso-latin-1, or UTF-8.
They are generally mistaken, as per
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4
http://www.apps.ietf.org/rfc/rfc3986.html
So, let me add an item to the previous shortlist :
11) in html <form> elements, always specify the attribute
method="POST"
This way, form input elements will be passed in the /body/ of the HTTP
request (and not in the URL, like in my GET example above).
At least for the body of a HTTP request, the browser can, and /should/
send charset/encoding information allowing the server to know how the
submitted parameters are encoded.
There seems to be a recent /tendency/ for browsers to use UTF-8 for
encoding request URLs, but it is by no means yet a universal thing.
(In IE for instance, it is a setting that must be turned on in "Internet
Options").
(*) This is a little-known fact, but there exists in fact a Klingon
relay station on Earth connected to our Internet, and the Klingons in
their spaceships use it from time to time to access Wikipedia and have a
good laugh. Their keyboards and browsers are different from ours of course.
(**) and I bet someone is going to get back here and say that this
cannot possibly be a valid UTF-8 sequence.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org