Mike Wilson wrote:
Hi Chris,
I'm aware of the two levels of encoding but I'm wondering whether
servlet specification writers were :-)
Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".
Example 1: path /ä in URL-encoded Unicode as sent from browser
GET /%C3%A4
request.getRequestURI() -> "/%C3%A4"
request.getPathInfo() -> "/ä"
Example 2: path /ä in "binary" Unicode
GET /.. [0xC3,0xA4]
request.getRequestURI() -> "/.." [0xC3,0xA4]
request.getPathInfo() -> "/ä"
So here we can see that getRequestURI() returns the path completely
undecoded, ie doesn't apply URL decoding nor character decoding. In
example 1 this is what I expected, but in example 2 the result is
that getRequestURI() returns a String containing undecoded binary.
I would expect a String to have been converted to the appropriate
character set, otherwise the method should return a byte[].
Internally Tomcat deals with both these examples as we can see
getPathInfo() always return the correct decoded path, so I guess
this issue is all about how to interpret the servlet specification.
The servlet 3.0 pdf doesn't give any details on the getRequestURI()
method, so the only clue I can find is the getRequestURI() javadoc
text:
"The web container does not decode this String."
but the examples given in javadoc only illustrates the removal of
query string and don't go into any kind of encoding.
So the question is if the javadoc "does not decode" text:
- only applies to URL-encoding (so non-URL-encoded values should
go through character set decoding)
- or, applies also when only character encoding is used (in which
case I think the specification has a bug, as getRequestURI()
then should return byte[])
?
[Naturally, not doing URL-decoding also means that the underlying
character encoding remains untouched. The "bug" here is when only
character encoding is present. F ex, this appears in some mod_jk
configurations.]
Hi.
(being in a contest with Mark E. here,)
My 2.5 cent, as someone who is not an expert at Java nor Tomcat per se, but who has spent
an extensive amount of time on the question of dealing with multiple character sets in a
web context.
I believe that your example #2 above is simply illegal.
One is not supposed to send such bytes in a URL without URL-encoding them.
That's per the HTTP RFC itself :
RFC 2616 3.2.2 & 3.2.3
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
-> RFC 2396 part 2. URI Characters and Escape Sequences
(http://www.ietf.org/rfc/rfc2396.txt)
And I believe that the fact that Tomcat is returning the "correct" translation in the
corresponding request.getPathInfo() is purely accidental, and it could be argued that this
is a bug in Tomcat : the request should probably have been rejected, because the requested
URL was invalid.
But it was not rejected, so it filtered further down, and because you did specify that the
URL-encoding was to be seen as UTF-8, something further down the line converted this
2-byte UTF-8 sequence in the appropriate internal representation of the character "ä" in
Java, as seen in your logging of request.getPathInfo().
(See RFC 2616, 5.1.2 Request-URI :
"The Request-URI is transmitted in the format specified in section 3.2.1. If the
Request-URI is encoded using the "% HEX HEX" encoding [42], the origin server MUST decode
the Request-URI in order to properly interpret the request. Servers SHOULD respond to
invalid Request-URIs with an appropriate status code. ")
So if we disregard this invalid URL example #2 (since it is invalid and thus any further
behaviour could be considered as "undefined"), we are left with the general case #1.
The RFCs 2616 and 2396 do not mandate any specific character set/encoding for
the request.
The only thing that they say, is that if the request contains bytes other than the ones
considered as "reserved" or "safe", they should be "URL-encoded" prior to transmission by
the client to the server; and that the first thing that the server should do on reception,
is to "URL-decode" them and restore the original bytes representation, as the client meant
to send them.
And here is one area where the specs are failing : there is no way, in the HTTP protocol,
for the client to indicate to the server what the original character set/encoding of the
URL is; so how can the server know ?
My own interpretation would be as follows :
- in the absence of any other information, the URL after URL-decoding should be viewed as
being in the ISO-8859-1 encoding, as this is the "default character set/encoding" for HTTP
(1.1) in general.
- and any other interpretation depends on a prior agreement between client and
server.
And the URIEncoding attribute of the Tomcat Connector can be considered as such a prior
client-server agreement, like : "in all the applications accessed through this Connector,
the client and the server agree beforehand that any URLs requested by the client will be
Unicode, UTF-8 encoded".
In other words, if your application can guarantee that any request URL sent by one of its
cients will be UTF-8 encoded, /then/ you can use the URIEncoding="UTF-8" attribute in
Tomcat. And only then.
(because e.g. if one of the client users /types/ a URL in the URL bar of his browser, and
this URL happens to target your Tomcat application, you can never be sure that the URL
will be UTF-8 encoded when the browser sends it, because that depends on the settings in
the browser)
The URIencoding attribute is something which Tomcat adds, outside the HTTP specification
(and even outside the Servlet Spec, AFAIK), to make life easier for the Tomcat application
programmers : because Tomcat webapps are written in Java; because the internal character
set of Java is Unicode; and because it is likely, on a Tomcat host, that all static and
JSP pages will be saved as UTF-8 encoded, therefore it is easier to allow the programmer
to just "assume" that when he uses request.getPathInfo() (or similar calls like
request.getParameters()), he will get a Java string, properly decoded, if the client sent
it that way (which in the general case it would mostly do).
And then, to get back to the initial question, I would assume that request.getRequestURI()
is really meant as a "low-level" call, which returns the request URI "as is", before /any/
interpretation has taken place (not even the URL-decoding (which should happen first), and
much less any character set decoding (which should happen later)).
While the other calls (like request.getPathInfo() are higher-level calls, which return
strings which have already been URL-decoded and character-set decoded.
And if you want to see the underlying issues in all their glory, I suggest the following
experiment :
1) in a Linux system's shell window, set your locale to one based on UTF-8. (and make sure
that your "terminal" is also set that way).
Then inside one of your webapp's directories, create a file named "ÄÖÜ.txt" (I am
assuming that you can enter that, considering your examples above), with some text A in
it. After creating the file, do an "ls" and a "cat" to see what you got.
2) change your locale and client settings to one based on ISO-8859-1, and create another
file named "ÄÖÜ.txt", with some different text B content. Do an "ls" and a "cat" again,
to see that you really have 2 files with different names and contents.
3) now use a browser (preferably IE for once), and try to request either one of these
files through Tomcat, by typing your request in the browser's URL bar.
You can play around with the settings of the browser (send URLs as ..), with the
URIencoding attribute in the Tomcat Connector, and the "locale" under which Tomcat is started.
To vary a bit, you can also try to put the corresponding links in a couple of html pages,
with different encodings for the pages.
For even more fun, you can also create a little webapp which will accept the name of the
desired file as a request parameter, open it and return its content.
It is only to English-speaking Java programmers writing English-speaking applications that
the matter may appear simple and settled.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org