Daniel Henrique Alves Lima wrote

On Wed, 2009-07-08 at 18:14 +0200, André Warnier wrote:


6) In your application, you can decide to interpret this series of bytes, as a string in the UTF-8 encoding, and decode it as such into Unicode *characters*. Forget about any parameters to specify the charset of URLs, they only confuse things totally. The only way you know what was the underlying encoding, is when you know for sure that all URLs that will hit your server, come from a known source of which you controlled the encoding.

?

To use an example :

Suppose you give me the URL to your webapp, and it is
http://your-server.somewhere.br/yourapp

Suppose I use this URL, and add a query string, so that it arrives to your server as a GET request for
/yourapp?param=%45abcd%f3%b9123%c4%20xy

then, you have absolutely no way, after URL-decoding the above into a series of bytes, to know under which character set I actually composed that query string.

It /could be/, that the sequence %c4%20 that you see above, is actually the UTF-8 encoding of a single Unicode character.(**)

But it could also be that in fact it is the two iso-8859-1 characters "Ä" and "space". And it could also be that, together with the "x" which follows, it is the tri-byte encoding of the Klingon symbol for breakfast.(*)

In order to decide on an interpretation of that query string using a certain character set and encoding, you would have to know something about me and my browser, which on the WWW you don't know.

The only way you could /assume/ a certain character set and encoding, would be if this request could only originate from a page that your application sent to my browser beforehand, in which you have done your best to ensure that whatever "click" results in a request with a known charset and encoding.
That's why all the previous details are important.

Note that some people variously assume that a HTTP URL is necessarily expressed in US-ASCII, or iso-latin-1, or UTF-8.
They are generally mistaken, as per
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4
http://www.apps.ietf.org/rfc/rfc3986.html

So, let me add an item to the previous shortlist :

11) in html <form> elements, always specify the attribute
method="POST"
This way, form input elements will be passed in the /body/ of the HTTP request (and not in the URL, like in my GET example above). At least for the body of a HTTP request, the browser can, and /should/ send charset/encoding information allowing the server to know how the submitted parameters are encoded.

There seems to be a recent /tendency/ for browsers to use UTF-8 for encoding request URLs, but it is by no means yet a universal thing. (In IE for instance, it is a setting that must be turned on in "Internet Options").


(*) This is a little-known fact, but there exists in fact a Klingon relay station on Earth connected to our Internet, and the Klingons in their spaceships use it from time to time to access Wikipedia and have a good laugh. Their keyboards and browsers are different from ours of course.

(**) and I bet someone is going to get back here and say that this cannot possibly be a valid UTF-8 sequence.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to