Dear list, in https://bz.apache.org/bugzilla/show_bug.cgi?id=59317 HttpServletRequest.getRequestURI() has been changed for Tomcat 7.0.70 onwards to always return an encoded URI, which matches the servlet 3.0 specification. However the encoding for the path component of the url seems to be incorrect, so I wanted to raise the issue on the mailing list first before opening a bug ticket. I could not find any other related ticket on the bug tracker or any newer discussion in the mailing list archives since the problem was fixed.
My apologies if this mail is a bit lengthy, but please bear with me as I want to provide a thorough problem description. The dispatchersUseEncodedPaths context attribute has been introduced in Ticket #59317 to revert to the "old" behavior. Still this is broken, as it seems to encode + characters in dispatched URIs no matter if setting the value to "true" or "false". (That is, the + is not kept literally.) Please note that "+" is a perfectly valid character in the path component of an URL and has no special meaning (e.g. as a space as for a query string like ?foo=bar+baz). For instance it is used by Google as literal character https://plus.google.com/+Google while https://plus.google.com/%20Google returns a 404. I will return to the details of whether + is a valid character in an url path further below. = Problem statement = We are using + characters in URLs like https://www.example.com/myservlet/url+with+spaces/sub+url.html which is handled by a HttpServlet. For each of these URL's there is also a prefixed version for partners, e.g. https://example.com/prefix/myservlet/url+with+spaces/sub+url.html Now if such a prefix is encountered, it gets removed by a servlet Filter and the request is dispatched to the URL without the prefix, e.g. /prefix/myservlet/url+with+spaces/sub+url.html is dispatched to /myservlet/url+with+spaces/sub+url.html, which in turn is handled by the HttpServlet. (That is in a Filter: request.getRequestDispatcher("/myservlet/url+with+spaces/sub+url.html").forward(request, response);) Now when calling HttpServletRequest.getRequestURIin the Servlet, the return values are as follows: For Tomcat <= 7.0.69: Calling the url directly: /myservlet/url+with+spaces/sub+url.html Calling the url with a prefix: /myservlet/url+with+spaces/sub+url.html Since 7.0.70 the return value of request.getRequestUri() from the Servlet is very inconsistent: Calling the URL directly: /myservlet/url+with+spaces/sub+url.html Now depending on the value of dispatchersUseEncodedPaths: Calling the prefixed URL and "false" (Note the %2B instead of +): /myservlet/url%2Bwith%2Bspaces/sub%2Burl.html Calling the prefixed URL and "true" (+ is replaced by %20): /myservlet/url%20with%20spaces/sub%20url.html In any case, this does not match the value as if the url was called directly and worse the default behavior is not equivalent to the original url. The expected behavior here is that instead of encoding the "+" for "false" or replacing it by a space, it should not be encoded at all. The reason is that in the catalina URLEncoder.DEFAULT at https://github.com/apache/tomcat/blob/trunk/java/org/apache/catalina/util/URLEncoder.java "+" is not in the list of safe characters. As URLEncoder.DEFAULT is used in all places of the changeset for the bug ticket #59317 from the beginning of this mail, "+" characters will always be encoded. See https://github.com/apache/tomcat/commit/eb195bebac8239b994fa921aeedb136a93e4ccaf#diff-8b91a9296e19012bf6be4bdf975fab0d for details. = On the validity of "+" in URLs = An HTTP url typically consists of a protocol, host, path and query. Lets focus on the last two: For /foo+bar?baz=a+b the path is /foo+bar and the query baz=a+b. While in the query string the + character has a special meaning as a space, this is not the case for the path, i.e. it is just a regular character. Although the encoding of path and query string are somewhat similar, they are NOT the same! The query is specified as application/www-form-urlencoded, but the path is not. = See also = Question on stack overflow: stackoverflow.com/questions/1005676/urls-and-plus-signs Blog Post listing valid characters in URI components, see section "The reserved characters are different for each part": https://web-beta.archive.org/web/20150509184317/http://blog.lunatech.com:80/2009/02/03/what-every-web-developer-must-know-about-url-encoding According RFCs: https://tools.ietf.org/html/rfc3986#section-2.2 https://tools.ietf.org/html/rfc3986#section-2.3 Note that the set of reserved characters is different for each scheme and URI component as also stated in the blog post above. Definition of the HTTP URI scheme in RFC 7230, section 2.7.1/2.7.3) (p. 17ff): https://tools.ietf.org/html/rfc7230 To my knowledge there is no place in the above RFCs stating that a + must be encoded in the path component of an URI or that it has a special meaning (unlike in query strings). Follow-up discussion after #59317 was fixed: http://marc.info/?l=tomcat-user&m=146800805502015 = How do other servlet containers handle this? = For Jetty I found the following issue: https://bugs.eclipse.org/bugs/show_bug.cgi?id=435017 = Reproducing the problem = I created the following Gist (to keep this mail shorter): https://gist.github.com/tburny/468e635c176752f21251fc641450594d I ran this with Tomcat 7.0.69 and 7.0.77, but I would assume that all versions affected by #59317 are also affected by the behavior I described. My question is whether this behavior is intended or if this is a bug. As I'm a native German speaker, I apologize for any grammar mistakes or misspellings. Thank you for your efforts and patience while reading this mail. Kind regards, Tobias Brennecke --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org