Re: Character set issue

Konstantin Kolinko Mon, 05 Dec 2011 16:03:27 -0800

2011/12/6 André Warnier <a...@ice-sa.com>:
> Marvin Addison wrote:
>>>
>>> /can/ the servlet (or one of the filters)
>>> do anything that would cause the value of "name1" to /not/ be a correct
>>> Java
>>> "TÜV" string in the servlet ?
>>
>>
>> Yes, absolutely.  If this is a posted value and some filter fires that
>> coerces the encoding (e.g. request.getParameter() in the case of POST)
>> of the request, all subsequent filters and the servlet will see the
>> string in the encoding of the first filter.  This is why it's
>> important to set the encoding as early in the servlet processing
>> pipeline as possible.
>
>
> Thank you for the answer.
>
>
>>
>> For your particular case it's hard to imagine an encoding in practice
>> that would make that string appear incorrectly.  Both iso-8859-1 and
>> utf-8 should handle Ü correctly.
>
>
> I don't think that's true.  A "Ü" in iso-8859-1 is a single byte (\xDC).  In
> Unicode/UTF-8 encoding, it is 2 bytes (\xC39C).  (The Unicode codepoint of
> "Ü" is 00DC (hex), but that's a different matter.)
>
> So if the servlet reads a parameter from the post, thinking the post is
> UTF-8 while it is really iso-8859-1, and this parameter is a "Ü", the
> servlet will read 2 bytes, getting \xDC and whichever byte follows it, and
> get garbage, because \xDC followed by any other byte is probably not valid
> UTF-8.
> On the other hand, if the servlet reads a parameter from the post, thinking
> the post is iso-8859-1 while it is really UTF-8, and this parameter is a
> "Ü", the servlet will read a single byte (\xC3), which will be converted to
> the Java Unicode character with codepoint 00C3 (hex), which is a capital A
> tilde (can't even type that on my German keyboard).
>
> In fact, this is what happens in reality :
>
> We have a html page, defined as being content-type="text/html;
> charset=UTF-8".
> It is saved as UTF-8, by a Unicode-savvy editor.
> It is received by the browser, and the browser (IE or Firefox) says that the
> document is UTF-8.
> The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
> The form contains an input text box, in which the user types a "Ü" and then
> submits the form.
>
> In the normal configuration of the target webapp, there are
> filter1
> filter2
> servlet
> (in that order).
> servlet reads the post parameters and the servlet gets garbage instead of
> the Java string "Ü".
>
> If we remove filter1 and filter2, leaving servlet alone, then servlet reads
> the proper "Ü".
>
> In we re-instate filter1 and filter2, and in filter2 (the only piece of
> which I control the code), I add an early call to
> request.setCharacterEncoding("UTF-8");
> then servlet gets the correct string.
>
> Who is "responsible" for setting the request character set ? In my naive
> understanding, I thought that whenever a method call happens which requires
> parsing the request body, and if by that time the request encoding has not
> been set explicitly, it would be Tomcat code which would evaluate the
> circumstances and set the encoding appropriately.
> Such as :
> - default is iso-8859-1 (as per HTTP default)
> - but if the request somehow says otherwise (*), then whatever the request
> says.
>  ((*) which for a POST it should always do, no ?)
>
> Is that a wrong understanding ?
> (I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)
>
> filter2 contain calls, in that order, to
> - config.getInitParameter
> - optionally, for testing : request.setCharacterEncoding("UTF-8")
> - request.getRequestURL
> - request.getQueryString
> - request.getRemoteAddr
> - request.getHeaderNames
> - request.getHeader
> - request.getAttributeNames
> .. and, finally, a
> - request.getParameter
>
> Is it then the responsibility of filter2 to set the request encoding ?
> Should the optional request.setCharacterEncoding become mandatory ?
> Should the request.setCharacterEncoding call be made just before the
> request.getParameter, or is there another earlier method call in the list
> above that can trigger the encoding to be already set ?
>


Parameters parsing happens once and is triggered by the first call
that requests them.
That call is usually request.getParameter(), but there are two other
similar methods.

At _that_ moment the conversion from bytes to Strings happens and the
request encoding must already be set.

It is application's responsibility to set the request encoding. It
defaults to ISO-8859-1 if not set explicitly. (Maybe it will parse
charset value if that is specified in Content-Type header of request,
but most browsers do not include charset in their request, so that is
irrelevant).

Note, that there is standard "SetCharacterEncodingFilter" in Tomcat 7.
(In 7.0 it is in o.a.c.filters package, in 6.0 and 5.5 it is examples webapp).

Once again,
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Character set issue

Reply via email to