Re: Character set issue

André Warnier Mon, 05 Dec 2011 14:14:22 -0800

Marvin Addison wrote:

/can/ the servlet (or one of the filters)
do anything that would cause the value of "name1" to /not/ be a correct Java
"TÜV" string in the servlet ?


Yes, absolutely.  If this is a posted value and some filter fires that
coerces the encoding (e.g. request.getParameter() in the case of POST)
of the request, all subsequent filters and the servlet will see the
string in the encoding of the first filter.  This is why it's
important to set the encoding as early in the servlet processing
pipeline as possible.


Thank you for the answer.


For your particular case it's hard to imagine an encoding in practice
that would make that string appear incorrectly.  Both iso-8859-1 and
utf-8 should handle Ü correctly.

I don't think that's true. A "Ü" in iso-8859-1 is a single byte (\xDC). In Unicode/UTF-8encoding, it is 2 bytes (\xC39C). (The Unicode codepoint of "Ü" is 00DC (hex), but that'sa different matter.)

So if the servlet reads a parameter from the post, thinking the post is UTF-8 while it isreally iso-8859-1, and this parameter is a "Ü", the servlet will read 2 bytes, getting\xDC and whichever byte follows it, and get garbage, because \xDC followed by any otherbyte is probably not valid UTF-8.On the other hand, if the servlet reads a parameter from the post, thinking the post isiso-8859-1 while it is really UTF-8, and this parameter is a "Ü", the servlet will read asingle byte (\xC3), which will be converted to the Java Unicode character with codepoint00C3 (hex), which is a capital A tilde (can't even type that on my German keyboard).


In fact, this is what happens in reality :

We have a html page, defined as being content-type="text/html; charset=UTF-8".
It is saved as UTF-8, by a Unicode-savvy editor.

It is received by the browser, and the browser (IE or Firefox) says that the document isUTF-8.

The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
The form contains an input text box, in which the user types a "Ü" and then 
submits the form.

In the normal configuration of the target webapp, there are
filter1
filter2
servlet
(in that order).
servlet reads the post parameters and the servlet gets garbage instead of the Java string 
"Ü".

If we remove filter1 and filter2, leaving servlet alone, then servlet reads the proper 
"Ü".

In we re-instate filter1 and filter2, and in filter2 (the only piece of which I controlthe code), I add an early call to

request.setCharacterEncoding("UTF-8");
then servlet gets the correct string.

Who is "responsible" for setting the request character set ? In my naive understanding, Ithought that whenever a method call happens which requires parsing the request body, andif by that time the request encoding has not been set explicitly, it would be Tomcat codewhich would evaluate the circumstances and set the encoding appropriately.

Such as :
- default is iso-8859-1 (as per HTTP default)
- but if the request somehow says otherwise (*), then whatever the request says.
  ((*) which for a POST it should always do, no ?)

Is that a wrong understanding ?
(I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)

filter2 contain calls, in that order, to
- config.getInitParameter
- optionally, for testing : request.setCharacterEncoding("UTF-8")
- request.getRequestURL
- request.getQueryString
- request.getRemoteAddr
- request.getHeaderNames
- request.getHeader
- request.getAttributeNames
.. and, finally, a
- request.getParameter

Is it then the responsibility of filter2 to set the request encoding ?
Should the optional request.setCharacterEncoding become mandatory ?

Should the request.setCharacterEncoding call be made just before the request.getParameter,or is there another earlier method call in the list above that can trigger the encoding tobe already set ?




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Character set issue

Reply via email to