Re: request.setCharacterEncoding() && request.getParameter()

André Warnier Wed, 08 Jul 2009 13:57:18 -0700

Daniel Henrique Alves Lima wrote

On Wed, 2009-07-08 at 18:14 +0200, André Warnier wrote:

6) In your application, you can decide to interpret this series ofbytes, as a string in the UTF-8 encoding, and decode it as such intoUnicode *characters*.Forget about any parameters to specify the charset of URLs, they onlyconfuse things totally.The only way you know what was the underlying encoding, is when you knowfor sure that all URLs that will hit your server, come from a knownsource of which you controlled the encoding.
?

To use an example :

Suppose you give me the URL to your webapp, and it is
http://your-server.somewhere.br/yourapp

Suppose I use this URL, and add a query string, so that it arrives toyour server as a GET request for

/yourapp?param=%45abcd%f3%b9123%c4%20xy

then, you have absolutely no way, after URL-decoding the above into aseries of bytes, to know under which character set I actually composedthat query string.

It /could be/, that the sequence %c4%20 that you see above, is actuallythe UTF-8 encoding of a single Unicode character.(**)

But it could also be that in fact it is the two iso-8859-1 characters"Ä" and "space".And it could also be that, together with the "x" which follows, it isthe tri-byte encoding of the Klingon symbol for breakfast.(*)

In order to decide on an interpretation of that query string using acertain character set and encoding, you would have to know somethingabout me and my browser, which on the WWW you don't know.

The only way you could /assume/ a certain character set and encoding,would be if this request could only originate from a page that yourapplication sent to my browser beforehand, in which you have done yourbest to ensure that whatever "click" results in a request with a knowncharset and encoding.

That's why all the previous details are important.

Note that some people variously assume that a HTTP URL is necessarilyexpressed in US-ASCII, or iso-latin-1, or UTF-8.

They are generally mistaken, as per
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4
http://www.apps.ietf.org/rfc/rfc3986.html

So, let me add an item to the previous shortlist :

11) in html <form> elements, always specify the attribute
method="POST"

This way, form input elements will be passed in the /body/ of the HTTPrequest (and not in the URL, like in my GET example above).At least for the body of a HTTP request, the browser can, and /should/send charset/encoding information allowing the server to know how thesubmitted parameters are encoded.

There seems to be a recent /tendency/ for browsers to use UTF-8 forencoding request URLs, but it is by no means yet a universal thing.(In IE for instance, it is a setting that must be turned on in "InternetOptions").

(*) This is a little-known fact, but there exists in fact a Klingonrelay station on Earth connected to our Internet, and the Klingons intheir spaceships use it from time to time to access Wikipedia and have agood laugh. Their keyboards and browsers are different from ours of course.

(**) and I bet someone is going to get back here and say that thiscannot possibly be a valid UTF-8 sequence.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: request.setCharacterEncoding() && request.getParameter()

Reply via email to