Marvin Addison wrote:
/can/ the servlet (or one of the filters)
do anything that would cause the value of "name1" to /not/ be a correct Java
"TÜV" string in the servlet ?
Yes, absolutely. If this is a posted value and some filter fires that
coerces the encoding (e.g. request.getParameter() in the case of POST)
of the request, all subsequent filters and the servlet will see the
string in the encoding of the first filter. This is why it's
important to set the encoding as early in the servlet processing
pipeline as possible.
Thank you for the answer.
For your particular case it's hard to imagine an encoding in practice
that would make that string appear incorrectly. Both iso-8859-1 and
utf-8 should handle Ü correctly.
I don't think that's true. A "Ü" in iso-8859-1 is a single byte (\xDC). In Unicode/UTF-8
encoding, it is 2 bytes (\xC39C). (The Unicode codepoint of "Ü" is 00DC (hex), but that's
a different matter.)
So if the servlet reads a parameter from the post, thinking the post is UTF-8 while it is
really iso-8859-1, and this parameter is a "Ü", the servlet will read 2 bytes, getting
\xDC and whichever byte follows it, and get garbage, because \xDC followed by any other
byte is probably not valid UTF-8.
On the other hand, if the servlet reads a parameter from the post, thinking the post is
iso-8859-1 while it is really UTF-8, and this parameter is a "Ü", the servlet will read a
single byte (\xC3), which will be converted to the Java Unicode character with codepoint
00C3 (hex), which is a capital A tilde (can't even type that on my German keyboard).
In fact, this is what happens in reality :
We have a html page, defined as being content-type="text/html; charset=UTF-8".
It is saved as UTF-8, by a Unicode-savvy editor.
It is received by the browser, and the browser (IE or Firefox) says that the document is
UTF-8.
The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
The form contains an input text box, in which the user types a "Ü" and then
submits the form.
In the normal configuration of the target webapp, there are
filter1
filter2
servlet
(in that order).
servlet reads the post parameters and the servlet gets garbage instead of the Java string
"Ü".
If we remove filter1 and filter2, leaving servlet alone, then servlet reads the proper
"Ü".
In we re-instate filter1 and filter2, and in filter2 (the only piece of which I control
the code), I add an early call to
request.setCharacterEncoding("UTF-8");
then servlet gets the correct string.
Who is "responsible" for setting the request character set ? In my naive understanding, I
thought that whenever a method call happens which requires parsing the request body, and
if by that time the request encoding has not been set explicitly, it would be Tomcat code
which would evaluate the circumstances and set the encoding appropriately.
Such as :
- default is iso-8859-1 (as per HTTP default)
- but if the request somehow says otherwise (*), then whatever the request says.
((*) which for a POST it should always do, no ?)
Is that a wrong understanding ?
(I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)
filter2 contain calls, in that order, to
- config.getInitParameter
- optionally, for testing : request.setCharacterEncoding("UTF-8")
- request.getRequestURL
- request.getQueryString
- request.getRemoteAddr
- request.getHeaderNames
- request.getHeader
- request.getAttributeNames
.. and, finally, a
- request.getParameter
Is it then the responsibility of filter2 to set the request encoding ?
Should the optional request.setCharacterEncoding become mandatory ?
Should the request.setCharacterEncoding call be made just before the request.getParameter,
or is there another earlier method call in the list above that can trigger the encoding to
be already set ?
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org