Re: request.setCharacterEncoding() && request.getParameter()

André Warnier Wed, 08 Jul 2009 09:14:59 -0700

Daniel Henrique Alves Lima wrote:

        IE is the best :-)


"Note: The accept-charset attribute does not work properly in Internet
Explorer. If accept-charset='ISO-8859-1', IE will send data encoded as
'Windows-1252'."

That is only one of the issues (browser inconsistencies).

If you want to really tackle this complex issue, you need to besystematic, make sure you understand the bits and pieces, and doeverything right.

A short overview :

1) choose Unicode/UTF-8 as your charset/encoding, for *everything*.Don't try to mix and match, or you'll get in trouble. Promise.


Applying #1 above :

2) find out the available "locales" on the Linux host where you run thisTomcat.

"locale -a | more"
Pick one locale that has "utf8" in the name, note its name.
In the system script that starts Tomcat, add
export LC_ALL="pt_pt.u...@euro"
(or whichever locale you have chosen)

That sets the "system locale" for the JVM that runs Tomcat, and is a wayto make it independent from whatever may be the system's configured"default locale".


3) All your html pages should have a declaration like :
<meta http-equiv="content-type" value="text/html; charset=UTF-8" />

4) All your html <form> tags should have an attribute :
accept-charset="UTF-8"

5) a URL is in no particular charset.  A URL is *bytes*.

Any byte in a URL, that is not (generally speaking) such that it can berepresented by an ASCII letter a-zA-Z0-9, will be encoded as %xy, wherexy is the hexadecimal representation of this byte.After decoding these %xy things, the result is again bytes, and that'show your application sees it.

6) In your application, you can decide to interpret this series ofbytes, as a string in the UTF-8 encoding, and decode it as such intoUnicode *characters*.Forget about any parameters to specify the charset of URLs, they onlyconfuse things totally.The only way you know what was the underlying encoding, is when you knowfor sure that all URLs that will hit your server, come from a knownsource of which you controlled the encoding.

7) When submitting the values of the <input> tags of a form, browserswill generally respect the basic encoding of the html page in which theform was included, and (usually) also the "accept-charset" attribute.By specifying both, you almost always win, as long as the submitted formcomes from your application, and has the right encoding.

8) In theory, you should also make sure that all responses sent by yourserver to a browser, if they are html pages, contain the correct HTTPheader :

Content-type: text/html; charset=UTF-8
That, you can check with a browser add-on such as
- LiveHttpHeader for Firefox
- Fiddler2 for IE
and examine what goes out and what comes in.
You can also use Wireshark.
The good news is that most webservers do this correctly.

The bad news is that IE usually ignores this header, and makes its owndecision as to what the content is. Newer IE versions may be better.


9) Java's internal charset is Unicode.

So when you do request.getParameter(), you will always get what Javaconsiders to be the proper Unicode translation of how the parameter came in.The problem is to not let Java get confused about what it receives fromthe browser. By doing all the above, you minimise the chances that itwill be confused.

10) If you want to really make sure, include in all your forms somehidden input value, containing a known string with "accented" characters(áàéèÜÖ and such).Then, before you process any other parameter in your webapp, check ifthat string matches one that you have defined in your servlet.

If it does not, then something is wrong.





---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: request.setCharacterEncoding() && request.getParameter()

Reply via email to