Re: getRequestURI() in relation to Connector.URIEncoding

André Warnier Sun, 17 Feb 2013 08:54:50 -0800

Mike Wilson wrote:

Hi Chris,

I'm aware of the two levels of encoding but I'm wondering whetherservlet specification writers were :-)

Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".

Example 1: path /ä in URL-encoded Unicode as sent from browser
  GET /%C3%A4
  request.getRequestURI() -> "/%C3%A4"
  request.getPathInfo()   -> "/ä"

Example 2: path /ä in "binary" Unicode
  GET /.. [0xC3,0xA4]
  request.getRequestURI() -> "/.." [0xC3,0xA4]
  request.getPathInfo()   -> "/ä"

So here we can see that getRequestURI() returns the path completely
undecoded, ie doesn't apply URL decoding nor character decoding. In
example 1 this is what I expected, but in example 2 the result is
that getRequestURI() returns a String containing undecoded binary.
I would expect a String to have been converted to the appropriate
character set, otherwise the method should return a byte[].

Internally Tomcat deals with both these examples as we can see

getPathInfo() always return the correct decoded path, so I guessthis issue is all about how to interpret the servlet specification.The servlet 3.0 pdf doesn't give any details on the getRequestURI()method, so the only clue I can find is the getRequestURI() javadoctext:

  "The web container does not decode this String."
but the examples given in javadoc only illustrates the removal of
query string and don't go into any kind of encoding.

So the question is if the javadoc "does not decode" text:
- only applies to URL-encoding (so non-URL-encoded values should
  go through character set decoding)

- or, applies also when only character encoding is used (in whichcase I think the specification has a bug, as getRequestURI()then should return byte[])

?

[Naturally, not doing URL-decoding also means that the underlying
character encoding remains untouched. The "bug" here is when only
character encoding is present. F ex, this appears in some mod_jk
configurations.]


Hi.
(being in a  contest with Mark E. here,)

My 2.5 cent, as someone who is not an expert at Java nor Tomcat per se, but who has spentan extensive amount of time on the question of dealing with multiple character sets in aweb context.


I believe that your example #2 above is simply illegal.
One is not supposed to send such bytes in a URL without URL-encoding them.
That's per the HTTP RFC itself :
RFC 2616 3.2.2 & 3.2.3 
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
-> RFC 2396 part 2. URI Characters and Escape Sequences
(http://www.ietf.org/rfc/rfc2396.txt)

And I believe that the fact that Tomcat is returning the "correct" translation in thecorresponding request.getPathInfo() is purely accidental, and it could be argued that thisis a bug in Tomcat : the request should probably have been rejected, because the requestedURL was invalid.But it was not rejected, so it filtered further down, and because you did specify that theURL-encoding was to be seen as UTF-8, something further down the line converted this2-byte UTF-8 sequence in the appropriate internal representation of the character "ä" inJava, as seen in your logging of request.getPathInfo().


(See RFC 2616, 5.1.2 Request-URI :

"The Request-URI is transmitted in the format specified in section 3.2.1. If theRequest-URI is encoded using the "% HEX HEX" encoding [42], the origin server MUST decodethe Request-URI in order to properly interpret the request. Servers SHOULD respond toinvalid Request-URIs with an appropriate status code. ")

So if we disregard this invalid URL example #2 (since it is invalid and thus any furtherbehaviour could be considered as "undefined"), we are left with the general case #1.


The RFCs 2616 and 2396 do not mandate any specific character set/encoding for 
the request.

The only thing that they say, is that if the request contains bytes other than the onesconsidered as "reserved" or "safe", they should be "URL-encoded" prior to transmission bythe client to the server; and that the first thing that the server should do on reception,is to "URL-decode" them and restore the original bytes representation, as the client meantto send them.

And here is one area where the specs are failing : there is no way, in the HTTP protocol,for the client to indicate to the server what the original character set/encoding of theURL is; so how can the server know ?


My own interpretation would be as follows :

- in the absence of any other information, the URL after URL-decoding should be viewed asbeing in the ISO-8859-1 encoding, as this is the "default character set/encoding" for HTTP(1.1) in general.

- and any other interpretation depends on a prior agreement between client and 
server.

And the URIEncoding attribute of the Tomcat Connector can be considered as such a priorclient-server agreement, like : "in all the applications accessed through this Connector,the client and the server agree beforehand that any URLs requested by the client will beUnicode, UTF-8 encoded".

In other words, if your application can guarantee that any request URL sent by one of itscients will be UTF-8 encoded, /then/ you can use the URIEncoding="UTF-8" attribute inTomcat. And only then.(because e.g. if one of the client users /types/ a URL in the URL bar of his browser, andthis URL happens to target your Tomcat application, you can never be sure that the URLwill be UTF-8 encoded when the browser sends it, because that depends on the settings inthe browser)

The URIencoding attribute is something which Tomcat adds, outside the HTTP specification(and even outside the Servlet Spec, AFAIK), to make life easier for the Tomcat applicationprogrammers : because Tomcat webapps are written in Java; because the internal characterset of Java is Unicode; and because it is likely, on a Tomcat host, that all static andJSP pages will be saved as UTF-8 encoded, therefore it is easier to allow the programmerto just "assume" that when he uses request.getPathInfo() (or similar calls likerequest.getParameters()), he will get a Java string, properly decoded, if the client sentit that way (which in the general case it would mostly do).

And then, to get back to the initial question, I would assume that request.getRequestURI()is really meant as a "low-level" call, which returns the request URI "as is", before /any/interpretation has taken place (not even the URL-decoding (which should happen first), andmuch less any character set decoding (which should happen later)).While the other calls (like request.getPathInfo() are higher-level calls, which returnstrings which have already been URL-decoded and character-set decoded.

And if you want to see the underlying issues in all their glory, I suggest the followingexperiment :1) in a Linux system's shell window, set your locale to one based on UTF-8. (and make surethat your "terminal" is also set that way).Then inside one of your webapp's directories, create a file named "ÄÖÜ.txt" (I amassuming that you can enter that, considering your examples above), with some text A init. After creating the file, do an "ls" and a "cat" to see what you got.2) change your locale and client settings to one based on ISO-8859-1, and create anotherfile named "ÄÖÜ.txt", with some different text B content. Do an "ls" and a "cat" again,to see that you really have 2 files with different names and contents.3) now use a browser (preferably IE for once), and try to request either one of thesefiles through Tomcat, by typing your request in the browser's URL bar.You can play around with the settings of the browser (send URLs as ..), with theURIencoding attribute in the Tomcat Connector, and the "locale" under which Tomcat is started.To vary a bit, you can also try to put the corresponding links in a couple of html pages,with different encodings for the pages.For even more fun, you can also create a little webapp which will accept the name of thedesired file as a request parameter, open it and return its content.

It is only to English-speaking Java programmers writing English-speaking applications thatthe matter may appear simple and settled.




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: getRequestURI() in relation to Connector.URIEncoding

Reply via email to