Hi, everybody. Thanks for the answers !

        Just to make myself clear:

        1. Always to set request charset before doing anything else fixes the
bug;
        2. When the bug is "on", only input data (request) is wrong. Previously
utf-8 encoded data is rendered right (response). At least, Firefox says
that the pages were using UTF-8 as encoding.


        Andre:


On Wed, 2009-07-08 at 18:14 +0200, André Warnier wrote:

> > 
> That is only one of the issues (browser inconsistencies).

Inconsistencies ? In Microsoft IE ? Never ! ;-)

> 
> If you want to really tackle this complex issue, you need to be 
> systematic, make sure you understand the bits and pieces, and do 
> everything right.
> A short overview :
> 
> 1) choose Unicode/UTF-8 as your charset/encoding, for *everything*. 
> Don't try to mix and match, or you'll get in trouble. Promise.

Checked.

> 
> Applying #1 above :
> 
> 2) find out the available "locales" on the Linux host where you run this 
> Tomcat.
> "locale -a | more"
> Pick one locale that has "utf8" in the name, note its name.
> In the system script that starts Tomcat, add
> export LC_ALL="pt_pt.u...@euro"
> (or whichever locale you have chosen)
> That sets the "system locale" for the JVM that runs Tomcat, and is a way 
> to make it independent from whatever may be the system's configured 
> "default locale".

I'll change any starting script to set this before Tomcat get running.
I've used to use LANG=C or JVM System properties directly (like
file.encoding, user.???? and etc).


> 
> 3) All your html pages should have a declaration like :
> <meta http-equiv="content-type" value="text/html; charset=UTF-8" />


Checked.

> 
> 4) All your html <form> tags should have an attribute :
> accept-charset="UTF-8"

I'll change the jsp files to include this.

> 
> 5) a URL is in no particular charset.  A URL is *bytes*.
> Any byte in a URL, that is not (generally speaking) such that it can be 
> represented by an ASCII letter a-zA-Z0-9, will be encoded as %xy, where 
> xy is the hexadecimal representation of this byte.
> After decoding these %xy things, the result is again bytes, and that's 
> how your application sees it.

Ok. I think that is nothing like that in this webapp.

> 
> 6) In your application, you can decide to interpret this series of 
> bytes, as a string in the UTF-8 encoding, and decode it as such into 
> Unicode *characters*.
> Forget about any parameters to specify the charset of URLs, they only 
> confuse things totally.
> The only way you know what was the underlying encoding, is when you know 
> for sure that all URLs that will hit your server, come from a known 
> source of which you controlled the encoding.

?

> 
> 7) When submitting the values of the <input> tags of a form, browsers 
> will generally respect the basic encoding of the html page in which the 
> form was included, and (usually) also the "accept-charset" attribute.
> By specifying both, you almost always win, as long as the submitted form 
> comes from your application, and has the right encoding.

Ok.

> 
> 8) In theory, you should also make sure that all responses sent by your 
> server to a browser, if they are html pages, contain the correct HTTP 
> header :
> Content-type: text/html; charset=UTF-8
> That, you can check with a browser add-on such as
> - LiveHttpHeader for Firefox
> - Fiddler2 for IE
> and examine what goes out and what comes in.
> You can also use Wireshark.
> The good news is that most webservers do this correctly.
> The bad news is that IE usually ignores this header, and makes its own 
> decision as to what the content is.  Newer IE versions may be better.

Ok. Page properties (in Firefox) is showing UTF-8 as encoding.

> 
> 9) Java's internal charset is Unicode.
> So when you do request.getParameter(), you will always get what Java 
> considers to be the proper Unicode translation of how the parameter came in.
> The problem is to not let Java get confused about what it receives from 
> the browser.  By doing all the above, you minimise the chances that it 
> will be confused.

Ok.

> 
> 10) If you want to really make sure, include in all your forms some 
> hidden input value, containing a known string with "accented" characters 
> (áàéèÜÖ and such).
> Then, before you process any other parameter in your webapp, check if 
> that string matches one that you have defined in your servlet.
> If it does not, then something is wrong.
> 

Ok.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to