Daniel Henrique Alves Lima wrote:
IE is the best :-)
"Note: The accept-charset attribute does not work properly in Internet
Explorer. If accept-charset='ISO-8859-1', IE will send data encoded as
'Windows-1252'."
That is only one of the issues (browser inconsistencies).
If you want to really tackle this complex issue, you need to be
systematic, make sure you understand the bits and pieces, and do
everything right.
A short overview :
1) choose Unicode/UTF-8 as your charset/encoding, for *everything*.
Don't try to mix and match, or you'll get in trouble. Promise.
Applying #1 above :
2) find out the available "locales" on the Linux host where you run this
Tomcat.
"locale -a | more"
Pick one locale that has "utf8" in the name, note its name.
In the system script that starts Tomcat, add
export LC_ALL="pt_pt.u...@euro"
(or whichever locale you have chosen)
That sets the "system locale" for the JVM that runs Tomcat, and is a way
to make it independent from whatever may be the system's configured
"default locale".
3) All your html pages should have a declaration like :
<meta http-equiv="content-type" value="text/html; charset=UTF-8" />
4) All your html <form> tags should have an attribute :
accept-charset="UTF-8"
5) a URL is in no particular charset. A URL is *bytes*.
Any byte in a URL, that is not (generally speaking) such that it can be
represented by an ASCII letter a-zA-Z0-9, will be encoded as %xy, where
xy is the hexadecimal representation of this byte.
After decoding these %xy things, the result is again bytes, and that's
how your application sees it.
6) In your application, you can decide to interpret this series of
bytes, as a string in the UTF-8 encoding, and decode it as such into
Unicode *characters*.
Forget about any parameters to specify the charset of URLs, they only
confuse things totally.
The only way you know what was the underlying encoding, is when you know
for sure that all URLs that will hit your server, come from a known
source of which you controlled the encoding.
7) When submitting the values of the <input> tags of a form, browsers
will generally respect the basic encoding of the html page in which the
form was included, and (usually) also the "accept-charset" attribute.
By specifying both, you almost always win, as long as the submitted form
comes from your application, and has the right encoding.
8) In theory, you should also make sure that all responses sent by your
server to a browser, if they are html pages, contain the correct HTTP
header :
Content-type: text/html; charset=UTF-8
That, you can check with a browser add-on such as
- LiveHttpHeader for Firefox
- Fiddler2 for IE
and examine what goes out and what comes in.
You can also use Wireshark.
The good news is that most webservers do this correctly.
The bad news is that IE usually ignores this header, and makes its own
decision as to what the content is. Newer IE versions may be better.
9) Java's internal charset is Unicode.
So when you do request.getParameter(), you will always get what Java
considers to be the proper Unicode translation of how the parameter came in.
The problem is to not let Java get confused about what it receives from
the browser. By doing all the above, you minimise the chances that it
will be confused.
10) If you want to really make sure, include in all your forms some
hidden input value, containing a known string with "accented" characters
(áàéèÜÖ and such).
Then, before you process any other parameter in your webapp, check if
that string matches one that you have defined in your servlet.
If it does not, then something is wrong.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org