Re: Sanity Check

Christopher Schultz Mon, 21 Nov 2016 09:10:17 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

AndrÃ©,

:)

On 11/19/16 12:31 PM, André Warnier (tomcat) wrote:
> With respect, this is not only "André's problem".

Agreed. I apologize if it seemed like I was suggesting that you are
the only one complaining.

> I would also posit that this being an English-language forum, the 
> posters here would tend to be predominently English-speaking
> developers, who are quite likely not the ones most affected by such
> issues. So the above numbers are quite likely to be
> unrepresentative of the number of people really affected by such
> matters.

Also agreed: we are a self-selected group. But while we are
predominantly English-speaking (even as a second of third language),
we are all serving user populations that fall outside of that realm.

For instance, my software is overwhelmingly deployed in the United
States, but we have full support for Simplified and Traditional
Chinese script (except for top-to-bottom and right-to-left rendering,
which we don't do quite yet).

So ISO-8859-1 has basically never worked for us, and we've been UTF-8
since roughly the beginning.

> And one could also look at the amount of code in applications and
> in Tomcat e.g., which is dedicated to working around linked
> issues. (Think "UseBodyEncodingForURL", 
> "org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)
> 
> Basically what I'm saying is that this 
> "posted-parameters-encoding-issue" is far from being "licked",
> despite the fact that native English-speaking developers may have a
> tendency to believe that it is.

Aah, I meant that *my* problem with *this* vendor is now an
open-and-shut case: they are squarely in violation of the
specifications. They may decide not to change, but at least we know
the truth of the matter and can move forward from there.

When it's unclear which party is at fault, the party with the bigger
bank account wins. (In that case, it's the vendor who has all the
money, not me :) But being able to claim that they advertise support
for this specification and clearly do not correctly-support it means
that really THEY should be making a change to their software, not me.

>> The only problem now is that it's not clear how to turn %C2%AE
>> into a character because you have to know that UTF-8 and not
>> Shift-JS or whatever is being used.
>> 
>>> -> Required parameters : No parameters -> Optional parameters :
>>> No parameters
>>> 
>>> OK. So no charset= parameter is allowed. My advise to specify
>>> the charset parameter was wrong.
> 
> No, it wasn't, not really.  I believe that you were on a good track
> there. It is the spec that is wrong, really.
> 
> One is allowed to question a spec if it appears wrong, or ? After
> all, RFC means "Request For Comment".

Sure. The problem is that the app can only do so much, especially when
the browsers behave in a very weird way... specifically by flatly
refusing to provide a charset parameter to the Content-Type when it's
appropriate.

Being allows (spec-wise) to include a charset along with that
Content-Type would be nice. An alternative would be to keep the spec
in-fact and add a new spec that introduces a new header e.g.
Encoded-Content-Type that would be a stand-in for the missing
"charset" parameter for a/xfwu.

>> Agreed: it is always against the spec(s) to specify a charset for
>> any MIME type that is not text/*.
> 
> Agreed. It just makes no sense for data that is not fundamentally
> "text". (Whether some such text data has or not a MIME type whose
> designation starts with "text/" is quite another matter. For
> example : the MIME type "application/ecmascript" refers to text
> data (javascript code) - and allows a charset attribute - even
> though its type name does not start with "text/"; there are many
> other types like that).

I think the real problem is that many application/* MIME types really
should be text/* types instead. Javascript is another good example.
a/xwfu is also, by definition, text. If you want to upload binaries,
you use application/binary or multipart/form-data with a subtype of
application/binary.

>>> Apache Tomcat supports the use of charset parameter with 
>>> Content-Type application/x-www-form-urlencoded in POST
>>> requests.
>> 
> 
> Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type 
> notwithstanding. Because if ever, clients such as standard web
> browsers would come to pay more attention and apply this attribute,
> much of the current confusion would go away.
> 
> Even better would be, if the RFC for
> "application/x-www-form-urlencoded" would be amended, to specify
> that this charset attribute SHOULD be provided, and that by default
> its value would be "ISO-8859-1" (for now; but there is a good case
> to make it UTF-8 nowadays).

Weirdly, the current behavior of web browsers is to:

a) Use the charset of the page that presented the form
and
b) Not report it to the server when submitting the POST request

So everybody loses, and you can't just claim "the standard should be
X". The standard default should be "undefined" :)

> In fact, if Tomcat was to strictly respect the MIME type definition
> of "application/x-www-form-urlencoded" and thus, after
> percent-decoding the POST body, interpret any byte of the resulting
> string strictly as being a character in the US-ASCII character set,
> that /would/ instantly break thousands of applications.

It would break everything, and I don't think it would be a "strict"
following of the spec. There is a hole in the spec because the server
can't (per spec) know the intended character encoding of the text
after it has been url-decoded.

I'm saying that the a/xwfu raw body itself must be (per spec)
US-ASCII. But once url-decoded, those bytes can be interpreted as
pretty much anything, UTF-8 being the most sensible these days, but
evidently ISO-8859-1 gets used a lot. Hence your Andr†® problem.
Again, not YOUR problem. :)

> it would now seem (unless I misinterpret, which is a distinct 
> possibility) that the content of a
> "application/x-www-form-urlencoded" POST, *after*
> URL-percent-decoding, *may* be a UTF-8 encoded Unicode string (it
> may also be something else). (There is even a provision for
> including a hidden "_charset_" parameter naming the
> charset/encoding. Yet another muddle ?) (This also applies only to
> HTML 5 <form> documents, but let's skip this for a moment).
> 
> Still, as far as I can tell, there is still to some extent the
> same "chicken-and-egg" problem, in the sense that in order to parse
> the above parameter, one would first have to decode the 
> "application/x-www-form-urlencoded" POST body, using some character
> set. For which one would need to know ditto character set before
> decoding.

The _charset_ thing is an horrible hack. It's worse than XML, but at
least the XML parser can prove to itself that the character set of the
bytes it's looking for are fairly close to the beginning of the
stream. There's no requirement that the _charset_ parameter, for
example, be the first parameter sent in the body of the request. :(

> Pretty much the same solution applies to POSTs in the 
> "multipart/form-data" format, where each posted parameter already
> has its own section with a MIME header.  Whenever one of these
> parameters is text, it should specify a charset. (And if it
> doesn't, then the current muddle applies).

The problem is that most of these parts don't have a text/* MIME type.
That's what I meant when I said you've "moved the problem" because
a/xwfu can still hide in there and nothing has been solved.

> The only remaining muddle is with the parameters passed inside the
> URL, as a query-string.

+1

> But for those, one could apply for example the same mechanism as
> is already applied for non-ASCII email header values (see 
> https://tools.ietf.org/html/rfc2047). This is not really ideal in
> terms of simplicity, but 1) the code exists and works and 2) it
> would certainly be preferable to the current muddled situation and
> recurrent parameter encoding problems. (And again, for clients
> which do not use this, then the current muddle applies).

UTF-8 is pretty much the agreed-upon standard these days, except where
it isn't :)

> Altogether, to me it looks like there are 2 bodies of experts, one
> on the HTML-and-client side and one on the HTTP-and-webserver side
> (or maybe these are 4 bodies), who have not really been talking to
> eachother constructively on this issue for years.

Yes and, oddly enough, they are all working under the W3C umbrella.

> The result being that instead of agreeing on some simple rules,
> each one of them kind of patched together its own separate set of
> rules (and a lot of complex software), to obtain finally something
> which still does not really solve the interoperability problem
> fundamentally.
> 
> The current situation is nothing short of ridiculous : - there are
> many character sets/encodings in use, but most/all of them are
> clearly defined and named - there are millions of webservers, and
> billions of web clients But fundamentally : - currently, a client
> has no way to know for sure what character set/encoding it should
> use, when it first tries to send some piece of text data to a
> webserver - currently, a webserver has no way to know for sure in
> what character set/encoding a client is sending text data to it

All true.

> I'm sure that we can do better.  But someone somewhere has to take
> the initiative.  And who better than an open-source software
> foundation whose products already dominates the worldwide webserver
> market ?

https://xkcd.com/927/

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYMyplAAoJEBzwKT+lPKRYeYEP/3aqK80OgJ7GsGjigpBmlhcd
Wa/gzX6LspRlqeGxfrXsPZqsWTdCq+4tEkS6bfIQpomXHLAqSQASBQK428dmc17p
YpO8sJ/RKiK4QEc40yT3jo2S1J+YM3wn9Qp8vMXgO0uNB9OUL+oZXN0ekZBYaxBB
IRTiKIuFnLLPKD6WrxaaYeijH/hsDV69GqX6+LJKTHuSozFQ/qblAbPd0NCxHf7g
hHw/dFRiL9vRXn1L89S+yoMWvsYLVYL4iVa7DCg5HE5z2an+b986ecyuHKALbu1c
dxrTHaV6neTC+vx0wqt9NUtUKGuJpWY2iE5RsXM1WRFgEQOr2/3RA5aLKXCt6FlP
nog+cOrqeu7PdnhvL5shCU9PdAvVnHGV622W0pONuWx1Mz3hmXT9BFBY7N71Q9Of
3oWByG1y9Py79/jlYbmZwHZPivFKxJfnVZHgk7w1qWxaoPM52rG7VsnOGYz9pBve
j9oxssmHFTHW8eZ315OZsg4Z+68WehHmnBNAM93+AEBhWiROH4JWINH5y8p1VNH+
qqSZT02cZaTILVOMudZRgSpGQpBfxyA3VNuVEMyOX58zh5V5pgVKEaCV1Y1H698s
PU/yvROkqLn3mdc9UdulPzMdiNS2Etc1nlZ8LjoX4GFfN/gfHYACr3aa4Cba/N0b
Uq5wouFq8YV8ESsw+1yN
=msJt
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Sanity Check

Reply via email to