Re: UTF-8 handling differs between two servlets within the same application

Christopher Schultz Mon, 23 Jun 2008 10:54:12 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,


André Warnier wrote:
| I am sorry to butt in again, but are you *really* sure that the problem
| is not earlier in the chain than what you think ?
| I have read the article at the link given earlier :
| http://wiki.apache.org/tomcat/Tomcat/UTF-8
| and I am quite sure that what is said in that article is wrong, or at
| least incomplete.  The article seems to assume that whatever the browser
| sends is always iso-8859-1, and that at the server level you can then
| just go and "decode" it into utf-8.  That is wrong, I can assure you.

You're right: you can't just assume that the incoming data is UTF-8. The
problem is that browsers often do not send a Content-Type encoding
string along with all POST requests. They /should/, but sometimes they
do not. In these cases, the server is left to guess. Guessing is hard,
but most browsers act somewhat predictably...

| Browsers will send utf-8 if the right conditions are met, and you will
| corrupt that data if you force it through a second encoding/decoding.
| Browsers will also sometimes send iso-8859-1, if you are not careful or
| if the browser is buggy. It happens.  (iso-8859-1 is the default in
| HTTP, so if you do not specify things differently, that is what you'll
get).

Most browsers will send request #1 in the same encoding that was used
for response #0. That is, if a page is encoded in UTF8, then the
encoding using to submit from that page (unless otherwise specified)
will use the same encoding -- even if that encoding is not specified in
the Content-Type header.

| In an ideal world, when a browser sends a string parameter via a POST,
| each parameter value should be enclosed in a part with a header and a
| content. The header of the part should have a line
| Content-type: text/plain; charset=xxxxx
| and the content of that part should then be in that xxxx charset encoding.

"parting" is not required, here. You just encode the whole POST with the
same encoding, and use the standard Content-Type header including the
encoding.

Now, back to the server. No server should ever clobber an encoding
specified by the client. The filter example on this page needs to be
fixed so that the encoding is only set if one is not detected. This is a
BIG BUG in the filter shown on that page, and someone should fix it
(maybe I will... I just registered for the Wiki).

If you /know/ that your pages are being sent in UTF-8 and you make a
reasonable assumption that requests with no Content-Type encoding will
use the encoding of the previous response, then the filter listed on the
aforementioned page is acceptable (again, with a check for an existing
content type encoding).

| It is quite possible that Tomcat's innards do not do things correctly
| when they decode a POST, and just deliver the raw parameter value as
| received.  But that would surprise me, and I would submit that it would
| then be a bug.

Tomcat does, in fact, decode the parameters properly. That's what the
setCharacterEncoding parameter does -- it sets the character encoding
that will be used by any Reader used to read the request's body. Your
code does not have to do anything special.

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhf4xkACgkQ9CaO5/Lv0PBWuACePccDgzP9kudNTq6v7d88qe98
KowAoILM6V+uJESshpiSQOGfAnvdDGA1
=4a8J
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 handling differs between two servlets within the same application

Reply via email to