Mike Wilson wrote:
Hi Chris,

I'm aware of the two levels of encoding but I'm wondering whether servlet specification writers were :-)
Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".

Example 1: path /ä in URL-encoded Unicode as sent from browser
  GET /%C3%A4
  request.getRequestURI() -> "/%C3%A4"
  request.getPathInfo()   -> "/ä"

Example 2: path /ä in "binary" Unicode
  GET /.. [0xC3,0xA4]
  request.getRequestURI() -> "/.." [0xC3,0xA4]
  request.getPathInfo()   -> "/ä"

So here we can see that getRequestURI() returns the path completely
undecoded, ie doesn't apply URL decoding nor character decoding. In
example 1 this is what I expected, but in example 2 the result is
that getRequestURI() returns a String containing undecoded binary.
I would expect a String to have been converted to the appropriate
character set, otherwise the method should return a byte[].

Internally Tomcat deals with both these examples as we can see
getPathInfo() always return the correct decoded path, so I guess this issue is all about how to interpret the servlet specification. The servlet 3.0 pdf doesn't give any details on the getRequestURI() method, so the only clue I can find is the getRequestURI() javadoc text:
  "The web container does not decode this String."
but the examples given in javadoc only illustrates the removal of
query string and don't go into any kind of encoding.

So the question is if the javadoc "does not decode" text:
- only applies to URL-encoding (so non-URL-encoded values should
  go through character set decoding)
- or, applies also when only character encoding is used (in which case I think the specification has a bug, as getRequestURI() then should return byte[])
?

[Naturally, not doing URL-decoding also means that the underlying
character encoding remains untouched. The "bug" here is when only
character encoding is present. F ex, this appears in some mod_jk
configurations.]


Hi.
(being in a  contest with Mark E. here,)
My 2.5 cent, as someone who is not an expert at Java nor Tomcat per se, but who has spent an extensive amount of time on the question of dealing with multiple character sets in a web context.

I believe that your example #2 above is simply illegal.
One is not supposed to send such bytes in a URL without URL-encoding them.
That's per the HTTP RFC itself :
RFC 2616 3.2.2 & 3.2.3 
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
-> RFC 2396 part 2. URI Characters and Escape Sequences
(http://www.ietf.org/rfc/rfc2396.txt)

And I believe that the fact that Tomcat is returning the "correct" translation in the corresponding request.getPathInfo() is purely accidental, and it could be argued that this is a bug in Tomcat : the request should probably have been rejected, because the requested URL was invalid. But it was not rejected, so it filtered further down, and because you did specify that the URL-encoding was to be seen as UTF-8, something further down the line converted this 2-byte UTF-8 sequence in the appropriate internal representation of the character "ä" in Java, as seen in your logging of request.getPathInfo().

(See RFC 2616, 5.1.2 Request-URI :
"The Request-URI is transmitted in the format specified in section 3.2.1. If the Request-URI is encoded using the "% HEX HEX" encoding [42], the origin server MUST decode the Request-URI in order to properly interpret the request. Servers SHOULD respond to invalid Request-URIs with an appropriate status code. ")


So if we disregard this invalid URL example #2 (since it is invalid and thus any further behaviour could be considered as "undefined"), we are left with the general case #1.

The RFCs 2616 and 2396 do not mandate any specific character set/encoding for 
the request.
The only thing that they say, is that if the request contains bytes other than the ones considered as "reserved" or "safe", they should be "URL-encoded" prior to transmission by the client to the server; and that the first thing that the server should do on reception, is to "URL-decode" them and restore the original bytes representation, as the client meant to send them.

And here is one area where the specs are failing : there is no way, in the HTTP protocol, for the client to indicate to the server what the original character set/encoding of the URL is; so how can the server know ?

My own interpretation would be as follows :
- in the absence of any other information, the URL after URL-decoding should be viewed as being in the ISO-8859-1 encoding, as this is the "default character set/encoding" for HTTP (1.1) in general.
- and any other interpretation depends on a prior agreement between client and 
server.

And the URIEncoding attribute of the Tomcat Connector can be considered as such a prior client-server agreement, like : "in all the applications accessed through this Connector, the client and the server agree beforehand that any URLs requested by the client will be Unicode, UTF-8 encoded".

In other words, if your application can guarantee that any request URL sent by one of its cients will be UTF-8 encoded, /then/ you can use the URIEncoding="UTF-8" attribute in Tomcat. And only then. (because e.g. if one of the client users /types/ a URL in the URL bar of his browser, and this URL happens to target your Tomcat application, you can never be sure that the URL will be UTF-8 encoded when the browser sends it, because that depends on the settings in the browser)

The URIencoding attribute is something which Tomcat adds, outside the HTTP specification (and even outside the Servlet Spec, AFAIK), to make life easier for the Tomcat application programmers : because Tomcat webapps are written in Java; because the internal character set of Java is Unicode; and because it is likely, on a Tomcat host, that all static and JSP pages will be saved as UTF-8 encoded, therefore it is easier to allow the programmer to just "assume" that when he uses request.getPathInfo() (or similar calls like request.getParameters()), he will get a Java string, properly decoded, if the client sent it that way (which in the general case it would mostly do).

And then, to get back to the initial question, I would assume that request.getRequestURI() is really meant as a "low-level" call, which returns the request URI "as is", before /any/ interpretation has taken place (not even the URL-decoding (which should happen first), and much less any character set decoding (which should happen later)). While the other calls (like request.getPathInfo() are higher-level calls, which return strings which have already been URL-decoded and character-set decoded.


And if you want to see the underlying issues in all their glory, I suggest the following experiment : 1) in a Linux system's shell window, set your locale to one based on UTF-8. (and make sure that your "terminal" is also set that way). Then inside one of your webapp's directories, create a file named "ÄÖÜ.txt" (I am assuming that you can enter that, considering your examples above), with some text A in it. After creating the file, do an "ls" and a "cat" to see what you got. 2) change your locale and client settings to one based on ISO-8859-1, and create another file named "ÄÖÜ.txt", with some different text B content. Do an "ls" and a "cat" again, to see that you really have 2 files with different names and contents. 3) now use a browser (preferably IE for once), and try to request either one of these files through Tomcat, by typing your request in the browser's URL bar. You can play around with the settings of the browser (send URLs as ..), with the URIencoding attribute in the Tomcat Connector, and the "locale" under which Tomcat is started. To vary a bit, you can also try to put the corresponding links in a couple of html pages, with different encodings for the pages. For even more fun, you can also create a little webapp which will accept the name of the desired file as a request parameter, open it and return its content.

It is only to English-speaking Java programmers writing English-speaking applications that the matter may appear simple and settled.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to