-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Garret,
On 2/6/20 10:25 AM, Garret Wilson wrote: > On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote: >> … >>> As of Tomcat 10, conf/web.xml contains the following: >>> >>> <!-- Set the default request and response character encodings >>> to UTF-8. --> >>> <request-character-encoding>UTF-8</request-character-encoding> >>> <response-character-encoding>UTF-8</response-character-encoding> >>> >>> >>> That *should* have the effect you are looking for but I confess I >>> haven't tested it in any great detail. >>> >> >> As I am sure many people (Christopher included) would agree, the >> real solution would be for browsers and other HTTP clients to >> indicate clearly in the request, the charset/encoding of each >> text parameter that they are sending. There are even HTTP headers >> already defined for that. > > > Which HTTP headers are you referring to? `Content-Type`? It is my > opinion that this is irrelevant and not applicable. > > As I explained (extensively) in my original post for this thread > back on 2019-01-08, the issue is not the charset of > `application/x-www-form-urlencoded`. That media type is made up of > ASCII characters. It doesn't matter whether you say it's ASCII, > ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the > same. Hmm. Not always. While it may be true that: 1. ASCII, ISO-8859-1, and UTF-8 are very common 2. ASCII, ISO-8859-1, and UTF-8 share the first 127 code points It is not true that: 3. All character encodings share the first 127 code points. UTF-16 doesn't follow that pattern. > At issue is when certain octets are encoded (as specified by the > `application/x-www-form-urlencoded` media type itself), what > charset to use when decoding them. This is independent of the > encoding of the media type itself; rather this is defined by the > specification for the format. Correct. And there is lack of agreement for URLs, so browsers decided to make it up. It's not possible to guess what the browser has chosen because it does not advertise it in any way (absent a standard). The only 100% reliable way to do it would be to add a parameter to every request which has a known-correct value that can be unambiguously decoded. You just keep re-decoding the whole URL until that parameter value matches the known-correct value. Sounds like a lot of fun to implement across a whole application, right? > Unfortunately https://tools.ietf.org/html/rfc1866 actually says we > should use ASCII when decoding the octets, but this is severely > antiquated and doesn't fit with modern practice. The WhatWG > essentially redefines the format to say that the octets must be > interpreted as UTF-8: > > https://url.spec.whatwg.org/#application/x-www-form-urlencoded > > So to summarize my view: > > * The decoding of the `application/x-www-form-urlencoded` media > type encoded octets is completely independent of the charset > indicated in the `Content-Type` header, and rather goes to the > specification of the format itself. It's strange, because Content-Type can contain a charset parameter, but MIME specifically says that "charset" parameters are only appropriate for "text/*" MIME types. So for application/x-www-form-urlencoded, you "shouldn't" add that parameter. But there's no particular reason NOT to include it (it doesn't actually violate any spec) and adding it COMPLETELY AND UNAMBIGUOUSLY indicates what the browser chose as the encoding. > * RFC 1866 is severely out of date and out of step, and the > WhatWG's specification of the `application/x-www-form-urlencoded` > media type should be used instead. (Modern browser practice would > seem to agree with me.) RFC 1886 has been very much superseded. Also, HTML specs shouldn't be defining HTTP semantics. So ignore whatever is in RFC 1866 on multiple grounds. > * Therefore `web.xml` settings, HTTP headers, etc. are all > irrelevant, as this is an issue dealing with the file format > itself, and the latest spec for the file format says to use UTF-8, > so everyone should use UTF-8 already. Except for everyone who already uses something else and expects everything to be backward-compatible. The problem is that you don't get to declare what's "best" for everyone and then the whole world does what you want. I happen to agree with you (Everyone should move to UTF-8 for everything. Everywhere. Forever.), but you have to recognize that there is history and entrenched systems, environments, and mindsets. > The new default `web.xml` in Tomcat 10 is a wonderful step in the > right direction. +1 - -chris -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl48NAoACgkQHPApP6U8 pFgJ6A/+JSArcUkqm3P6n0awICXTuqIx0TU1oIf9bzivpAI/Na9fr//ebnwzmvoy EXpbnn97B7Sy8uZ1wvT0+PQLbmwVmM/f7zBk4q+7Ba/ogkmrSHeLlsCIbLAlXOLD kr/xDE4ftxrwR2+ZwuQwxH0muFH+4rq2SBFWTQnGORCQDqRRK7eQoQYHWE0HIAxj cAJmwkQEQyi+YHdgaUo0L4BU7lvgPGk7JyjbzWBiigFYy/1Du1caE7PzYLa5G3wZ BrYDA6QoQA+nUmXHn/ayUVXvsZc2l/nU/uM5m68Tp1iEVxdgp4u8XtHuqgv0Nzda IeQq9HOP8wd7l27/dk2DvlZBmSWt2XDOI5ig+NoLPT1ixyQIqVJ2K8SyayGdUHW9 XJi/mqVqHF1h1okTgystt4mNTTBYFqFfwfBUWFK1T+9sUot8aJ2y6P20058mv5ds iQbEP0K0VJsUGSD+JJd+lvm6gI+54jNhnNgS1bFndbC5p4afNToCCKl8EBBENtbK 64xiolpux4VLFrgmzyG6gfbiSurJz+s3hgH29JJGfml/zdNS5QMI+fhsgOFThDrr 38Ul/QA4fRJehINAqqnsBFhJlymgvO/3PMGCDYCvWfq0cyBDOoKzWH2lscq5cXnz AMNiKU9roV1YdvUQPscSY7iPyDNq4JFDUdHa4pi7gp9JfXMlL7s= =Igs0 -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org