Re: Char Encoding text streams on Tomcat 5.5 and Linux

André Warnier Wed, 02 Dec 2009 02:35:36 -0800

Hi.

Just a quick line : Thank you for the test, and I am not forgettingthis, since I would really like to get to the bottom of it.I am fairly busy for the next 2 days however, and will revisit thatafter the current rush.I have a definite case where I am forced to set Tomcat's startup localeto iso-8859-1, otherwise I am getting wrong encodings. But I need toreview the exact characteristics of the case before continuing.



Christopher Schultz wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 11/30/2009 7:39 PM, André Warnier wrote:

Well, just make a simple test :
(I don't really know how to handle JSP pages, I only do servlets and
filters, otherwise I'd do it myself).

:)

-  create a simple html form with a UTF-8 charset, with a simple text
input box. Give it a method=POST.


Done:

<%...@page language="Java" contentType="text/html" pageEncoding="UTF-8" %>
<html>
  <body>
<%
  if("POST".equals(request.getMethod())) {
%>
  <p>
    file.encoding: <%= System.getProperty("file.encoding") %><br />
    ContentType: <%= request.getContentType() %><br />
    Charset: <%= request.getCharacterEncoding() %><br />
  </p>
  <p>
    Received text from client: <%= request.getParameter("q") %>" />
  </p>
<%
  }
%>
    <form method="POST" accept-charset="UTF-8">
      <input name="q" type="text" value="<%= request.getParameter("q") "
%> />

      <input type="submit" />
    </form>
  </body>
</html>

- then start Tomcat alternatively under a UTF-8 locale, then an
ISO-8859-1 locale, type some accented characters in your input box, and
submit the form.


Here's what I get when I submit your name, properly accented, into this
form:

"
file.encoding: UTF-8
ContentType: application/x-www-form-urlencoded
Charset: null

Received text from client: AndrÃ©
"

[note that this is "Andr" followed by a capital "A" with a tilde (~) on
top of it, followed by a copyright symbol "(c)" as two separate characters).

file.encoding is already UTF-8 and still this "bug" presents itself.

I tried re-starting Tomcat with LANG=en_US.ISO-8859-1 and I get this result:

"
file.encoding: ANSI_X3.4-1968
ContentType: application/x-www-form-urlencoded
Charset: null

Received text from client: AndrÃ©
"

The output is the same, except for the file.encoding which has changed:
Tomcat bones the interpretation of these strings in both situations.

Note that the client supplied no character encoding along with the
request, and that the form indicates that it accepts only UTF-8
encoding. Here, the client has screwed everything up by not including
the encoding of the form. Here's what gets sent over the wire:

Content-Type: application/x-www-form-urlencoded
Content-Length: 12
q=Andr%C3%A9

Notice that the bytes are Andr + 0xC3 0xA9

Let's see if we can manage to get that string of bytes some other way.

public class CharacterEncoding
{
    private static final char[] hex = "0123456789abcdef".toCharArray();

    public static String toByteString(byte[] a)
    {
        StringBuilder sb = new StringBuilder(a.length * 3);

        for(int i=0; i<a.length; ++i)
        {
            int high = (a[i] & 0xf0) >> 4;
            int low  = (a[i] & 0x0f);

            sb.append(hex[high]);
            sb.append(hex[low]);
            sb.append(' ');
        }

        return sb.toString();
    }

    public static void main(String[] args)
        throws Exception
    {
        String s = "André";

        System.out.println("Original string: " + s);
        System.out.println("UTF-8 bytes: " +
toByteString(s.getBytes("UTF-8")));
        System.out.println("ISO-8859-1 bytes: " +
toByteString(s.getBytes("ISO-8859-1")));
    }
}

The output of this program is:

Original string: André
UTF-8 bytes: 41 6e 64 72 c3 a9
ISO-8859-1 bytes: 41 6e 64 72 e9

You may recall that my web browser sent

41 6e 64 72 c3 a9

...which is the correct UTF-8 byte encoding of "André". The client is
using UTF-8 but leaving the Content-Type blank, which makes this a
client problem IMO.

The only solution is to use a force-UTF-8-filter when the client fails
to provide a character encoding along with a request. It's an ugly hack
and I'm disappointed that the venerable Firefox still has this problem.

Let's see what happens when we take the UTF-8 string and interpret it as
ISO-8859-1:

Bytes:      41 6e 64 72 c3 a9
UTF-8:       A  n  d  r     é   (note the é takes two bytes to express)
ISO-8859-1:  A  n  d  r  Ã  ©

So, in the absence of any other information, Tomcat is receiving a byte
string and must interpret it according to the spec, which is to default
to ISO-8859-1 since there is no charaset supplied with the Content-Type.

I'm curious about the horrendous bug, because I have seen phenomenons
like this.


See: not a bug in Tomcat. It's everyone else who's wrong :)

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAksVj90ACgkQ9CaO5/Lv0PA2gACgqnqziMA8J6qwF7RjgekT8YAh
Dz4AnRFTg95KN0VW7fVmKkxTaDgvDJ9R
=VUv9
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Char Encoding text streams on Tomcat 5.5 and Linux

Reply via email to