-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 André,
On 11/30/2009 7:39 PM, André Warnier wrote: > Well, just make a simple test : > (I don't really know how to handle JSP pages, I only do servlets and > filters, otherwise I'd do it myself). :) > - create a simple html form with a UTF-8 charset, with a simple text > input box. Give it a method=POST. Done: <%...@page language="Java" contentType="text/html" pageEncoding="UTF-8" %> <html> <body> <% if("POST".equals(request.getMethod())) { %> <p> file.encoding: <%= System.getProperty("file.encoding") %><br /> ContentType: <%= request.getContentType() %><br /> Charset: <%= request.getCharacterEncoding() %><br /> </p> <p> Received text from client: <%= request.getParameter("q") %>" /> </p> <% } %> <form method="POST" accept-charset="UTF-8"> <input name="q" type="text" value="<%= request.getParameter("q") " %> /> <input type="submit" /> </form> </body> </html> > - then start Tomcat alternatively under a UTF-8 locale, then an > ISO-8859-1 locale, type some accented characters in your input box, and > submit the form. Here's what I get when I submit your name, properly accented, into this form: " file.encoding: UTF-8 ContentType: application/x-www-form-urlencoded Charset: null Received text from client: André " [note that this is "Andr" followed by a capital "A" with a tilde (~) on top of it, followed by a copyright symbol "(c)" as two separate characters). file.encoding is already UTF-8 and still this "bug" presents itself. I tried re-starting Tomcat with LANG=en_US.ISO-8859-1 and I get this result: " file.encoding: ANSI_X3.4-1968 ContentType: application/x-www-form-urlencoded Charset: null Received text from client: André " The output is the same, except for the file.encoding which has changed: Tomcat bones the interpretation of these strings in both situations. Note that the client supplied no character encoding along with the request, and that the form indicates that it accepts only UTF-8 encoding. Here, the client has screwed everything up by not including the encoding of the form. Here's what gets sent over the wire: Content-Type: application/x-www-form-urlencoded Content-Length: 12 q=Andr%C3%A9 Notice that the bytes are Andr + 0xC3 0xA9 Let's see if we can manage to get that string of bytes some other way. public class CharacterEncoding { private static final char[] hex = "0123456789abcdef".toCharArray(); public static String toByteString(byte[] a) { StringBuilder sb = new StringBuilder(a.length * 3); for(int i=0; i<a.length; ++i) { int high = (a[i] & 0xf0) >> 4; int low = (a[i] & 0x0f); sb.append(hex[high]); sb.append(hex[low]); sb.append(' '); } return sb.toString(); } public static void main(String[] args) throws Exception { String s = "André"; System.out.println("Original string: " + s); System.out.println("UTF-8 bytes: " + toByteString(s.getBytes("UTF-8"))); System.out.println("ISO-8859-1 bytes: " + toByteString(s.getBytes("ISO-8859-1"))); } } The output of this program is: Original string: André UTF-8 bytes: 41 6e 64 72 c3 a9 ISO-8859-1 bytes: 41 6e 64 72 e9 You may recall that my web browser sent 41 6e 64 72 c3 a9 ...which is the correct UTF-8 byte encoding of "André". The client is using UTF-8 but leaving the Content-Type blank, which makes this a client problem IMO. The only solution is to use a force-UTF-8-filter when the client fails to provide a character encoding along with a request. It's an ugly hack and I'm disappointed that the venerable Firefox still has this problem. Let's see what happens when we take the UTF-8 string and interpret it as ISO-8859-1: Bytes: 41 6e 64 72 c3 a9 UTF-8: A n d r é (note the é takes two bytes to express) ISO-8859-1: A n d r à © So, in the absence of any other information, Tomcat is receiving a byte string and must interpret it according to the spec, which is to default to ISO-8859-1 since there is no charaset supplied with the Content-Type. > I'm curious about the horrendous bug, because I have seen phenomenons > like this. See: not a bug in Tomcat. It's everyone else who's wrong :) - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAksVj90ACgkQ9CaO5/Lv0PA2gACgqnqziMA8J6qwF7RjgekT8YAh Dz4AnRFTg95KN0VW7fVmKkxTaDgvDJ9R =VUv9 -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org