Hi Chris,

I'm aware of the two levels of encoding but I'm wondering whether 
servlet specification writers were :-)
Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".

Example 1: path /ä in URL-encoded Unicode as sent from browser
  GET /%C3%A4
  request.getRequestURI() -> "/%C3%A4"
  request.getPathInfo()   -> "/ä"

Example 2: path /ä in "binary" Unicode
  GET /.. [0xC3,0xA4]
  request.getRequestURI() -> "/.." [0xC3,0xA4]
  request.getPathInfo()   -> "/ä"

So here we can see that getRequestURI() returns the path completely
undecoded, ie doesn't apply URL decoding nor character decoding. In
example 1 this is what I expected, but in example 2 the result is
that getRequestURI() returns a String containing undecoded binary.
I would expect a String to have been converted to the appropriate
character set, otherwise the method should return a byte[].

Internally Tomcat deals with both these examples as we can see
getPathInfo() always return the correct decoded path, so I guess 
this issue is all about how to interpret the servlet specification. 

The servlet 3.0 pdf doesn't give any details on the getRequestURI() 
method, so the only clue I can find is the getRequestURI() javadoc 
text:
  "The web container does not decode this String."
but the examples given in javadoc only illustrates the removal of
query string and don't go into any kind of encoding.

So the question is if the javadoc "does not decode" text:
- only applies to URL-encoding (so non-URL-encoded values should
  go through character set decoding)
- or, applies also when only character encoding is used (in which 
  case I think the specification has a bug, as getRequestURI() 
  then should return byte[])
?

[Naturally, not doing URL-decoding also means that the underlying
character encoding remains untouched. The "bug" here is when only
character encoding is present. F ex, this appears in some mod_jk
configurations.]

Best regards
Mike

Christopher Schultz wrote:
> Mike,
> 
> On 2/14/13 9:51 AM, Mike Wilson wrote:
> > I can see that even if you specify URIEncoding=UTF-8 in
> > server.xml, calls to HttpServletRequest.getRequestURI() will still
> > return an undecoded String. (This is probably because of the
> > "specification text" in javadoc: "The web container does not decode
> > this String.")
> > 
> > My question is if this behaviour has changed throughout Tomcat 
> > versions?
> > 
> > We got problems with this when upgrading to Tomcat 7, and it seems 
> > we have been getting decoded strings previously when we were using 
> > Jboss 4 (based an Tomcat 5.5 IIRC).
> 
> I think you may be confusing character encoding versus URL encoding.
> The <Connector>'s URIEncoding is a character encoding (e.g.
> ISO-8859-1, UTF-8, etc.) that will be used to convert bytes into
> characters while URL encoding is the transformation of characters like
> "+" into spaces, %-decoding, etc.
> 
> What kind of encoding is (or isn't) happening that seems surprising?
> 
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEAREIAAYFAlEdQ6AACgkQ9CaO5/Lv0PCaDwCgkM6PsHbdLNEcHa+Tl6ZsNrWk
> D/sAoMCTm5yBd/Uzm19K/zxJ5oS/6CWr
> =eqtR
> -----END PGP SIGNATURE-----


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to