I have question (using Tomcat 9.0.12 on Windows 10), and I'd like someone on the Tomcat development team to clarify a distinction for me regarding resource charsets and octet decoding in a particular format. I am not a newbie, and although the answer to my question may seem obvious, it goes to a critical issue that I believe to be a fundamental bug in Tomcat encoding processing.

Let's say that as an HTTP client I retrieve a resource `readme.txt` from Tomcat, and Tomcat clearly indicates via the HTTP response headers that the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file contains, among things, a line that says:

    See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9 for more info.

I want parse the text file and present a live link to the user (email clients do this all the time), but I want to make the link "pretty" by decoding the URL. The question is: do I decode the octets using UTF-8, to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the octets, so that I show `…fullName=Flávio+José`? (Flávio José is a famous Brazilian forró singer and musician, by the way.)

The content type encoding of `readme.txt` is ISO-8859-1, so I must use ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding `…fullName=Flávio+José`, right??!

No, of course not. The decoding of the octet sequence is independent of the resource encoding, and represents a separate layer of encoding _on top_ of the resource encoding. It wouldn't matter whether the text file were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its octets should still be decoded using UTF-8 as per RFC 3986.

I'll get right to the point; the above was a rhetorical question used as an analogy.

The Tomcat FAQ at https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that the default encoding for an HTTP POST is ISO-8859-1. That is true. However Tomcat then goes further to then assume that it should decode _the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as well! This is simply wrong; the octets should be interpreted as a sequence of UTF-8 octets; see https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9` using `application/x-www-form-urlencoded`, Tomcat will interpret this request parameter as `Flávio José` in my servlet/JSP, when it should interpret it as `Flávio José`. (Tomcat correctly decodes the octet when used as a query parameter rather than a POST parameter.)

Now it may be that the FAQ is simply out of date; it still seems to think that encoded URI octets should not be interpreted as UTF-8, completely ignoring RFC 3986. If so, it is long out of date; RFC 3986 came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.) But out of date or not, the FAQ at https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends that to force Tomcat to interpret the `application/x-www-form-urlencoded` octets as UTF-8, I must set the `org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some `web.xml` file) to `UTF-8`. (I can alternatively put `<% request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough, it fixes the problem.

But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. They must be decoded as UTF-8, irrespective of what "character encoding" Tomcat assumes the content to be. Tomcat has updated the way it decodes URIs to support UTF-8; it is time Tomcat does the same for `application/x-www-form-urlencoded` values. The current approach is broken in the context of the modern web, and the workaround is simply wrong.

I also raised this at https://stackoverflow.com/q/54094982/421049 .

I would have filed a Tomcat Bugzilla issue, but the bug report form indicated I should report the problem on this list first.

Thank you in advance for your attention to this matter.

Garret Wilson
GlobalMentor, Inc.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to