I have question (using Tomcat 9.0.12 on Windows 10), and I'd like
someone on the Tomcat development team to clarify a distinction for me
regarding resource charsets and octet decoding in a particular format. I
am not a newbie, and although the answer to my question may seem
obvious, it goes to a critical issue that I believe to be a fundamental
bug in Tomcat encoding processing.
Let's say that as an HTTP client I retrieve a resource `readme.txt` from
Tomcat, and Tomcat clearly indicates via the HTTP response headers that
the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file
contains, among things, a line that says:
See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9
for more info.
I want parse the text file and present a live link to the user (email
clients do this all the time), but I want to make the link "pretty" by
decoding the URL. The question is: do I decode the octets using UTF-8,
to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the
octets, so that I show `…fullName=Flávio+José`? (Flávio José is a
famous Brazilian forró singer and musician, by the way.)
The content type encoding of `readme.txt` is ISO-8859-1, so I must use
ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding
`…fullName=Flávio+José`, right??!
No, of course not. The decoding of the octet sequence is independent of
the resource encoding, and represents a separate layer of encoding _on
top_ of the resource encoding. It wouldn't matter whether the text file
were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be
https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its
octets should still be decoded using UTF-8 as per RFC 3986.
I'll get right to the point; the above was a rhetorical question used as
an analogy.
The Tomcat FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that
the default encoding for an HTTP POST is ISO-8859-1. That is true.
However Tomcat then goes further to then assume that it should decode
_the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as
well! This is simply wrong; the octets should be interpreted as a
sequence of UTF-8 octets; see
https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means
if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9`
using `application/x-www-form-urlencoded`, Tomcat will interpret this
request parameter as `Flávio José` in my servlet/JSP, when it should
interpret it as `Flávio José`. (Tomcat correctly decodes the octet when
used as a query parameter rather than a POST parameter.)
Now it may be that the FAQ is simply out of date; it still seems to
think that encoded URI octets should not be interpreted as UTF-8,
completely ignoring RFC 3986. If so, it is long out of date; RFC 3986
came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.)
But out of date or not, the FAQ at
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends
that to force Tomcat to interpret the
`application/x-www-form-urlencoded` octets as UTF-8, I must set the
`org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some
`web.xml` file) to `UTF-8`. (I can alternatively put `<%
request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough,
it fixes the problem.
But as discussed above, this is completely wrong: the resource character
encoding of a request sent in `application/x-www-form-urlencoded` should
have absolutely no bearing on how the encoded octets within that
resource are decoded. They must be decoded as UTF-8, irrespective of
what "character encoding" Tomcat assumes the content to be. Tomcat has
updated the way it decodes URIs to support UTF-8; it is time Tomcat does
the same for `application/x-www-form-urlencoded` values. The current
approach is broken in the context of the modern web, and the workaround
is simply wrong.
I also raised this at https://stackoverflow.com/q/54094982/421049 .
I would have filed a Tomcat Bugzilla issue, but the bug report form
indicated I should report the problem on this list first.
Thank you in advance for your attention to this matter.
Garret Wilson
GlobalMentor, Inc.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org