-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Konstantin,
On 11/17/16 4:58 PM, Konstantin Kolinko wrote: > 2016-11-17 17:21 GMT+03:00 Christopher Schultz > <ch...@christopherschultz.net>: >> All, >> >> I've got a problem with a vendor and I'd like another opinion >> just to make sure I'm not crazy. The vendor and I have a >> difference of opinion about how a character should be encoded in >> an HTTP POST request. >> >> The vendor's API officially should accept requests in UTF-8 >> encoding. We are using application/x-www-form-urlencoded content >> type. >> >> I'm trying to send a message with a non-ASCII character -- for >> example, a ® (that's (R), the registered trademark symbol). >> >> The Java code being used to package-up this POST looks something >> like this: >> >> OutputStream out = httpurlconnection.getOutputStream(); >> out.print("notetext="); out.print(URLEncoder.encode("test®", >> "UTF-8")); out.close(); >> >> So the POST payload ends up being notetext=test%C2%AE or, on the >> wire, the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43 >> 32 25 41 45. >> >> The final bytes 25 43 32 25 41 45 are the characters % C 2 % A >> E. >> >> Can someone verify that I'm encoding everything correctly? >> >> The vendor is claiming that ® can be sent "directly" like one >> might do using curl: >> >> $ curl -d 'notetext=®' [url] >> >> and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae >> (note that c2 and ae are "bare" and not %-encoded). > > 1. That is a wrong way to use curl. The manual says that the > argument to -d should be properly urlencoded. The above value is an > incorrect one. > > https://curl.haxx.se/docs/manual.html See "POST (HTTP)" and below. +1 The curl manual says that -d is the same as --data-ascii, which is totally wrong here if they are accepting UTF-8. > 2. If you are submitting data programmatically, I wonder why you > are using simple "application/x-www-form-urlencoded". > > I think it would be better to use explicit charset argument in the > Content-Type value, as it is easy to do so with Java clients. Their API expects application/x-www-form-urlencoded. Everything else they do is in JSON... I have no idea why they don't accept JSON as input, but that's the deal. MIME types that aren't text/* aren't supposed to have Content-Type parameters. > 3. The application/x-www-form-urlencoded encoding was originally > specified in HTML specification. > > Current specification: > https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data > > It defers details to > https://url.spec.whatwg.org/#concept-urlencoded-serializer > > Historic, HTML 4.01: > https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 All true, but the spec argues with itself over the character encoding, and browsers make this worse with their stupid "I'll use whatever character encoding was used to load the page containing the form" behavior. With a software-client API, there basically is no spec. Their assertion is that their character encoding "is UTF-8". But it looks like they aren't doing it right. > My opinion is that the correct value on the wire is 25 43 32 25 41 > 45 = % C 2 % A E. So, the same bytes as I had, right? > If a vendor accepts non-encoded "c2 ae": it technically may work > (in some versions of some software), but this is not a standard > feature and one would better not rely on it. > > Technically, if non-encoded bytes ("c2 ae") are accepted, they > won't be confused with special character ("=", "&", "+", "%", > CRLF), as all multi-byte UTF-8 characters have 0x80 bit set. Their non-%-encoded bytes could be considered legitimate, because the application/x-www-form-urlencoded rules say that any character "in the character set of the request" can be dropped-into the request without being %-encoded. But they we are back to the problem of not knowing what the encoding of the request is. Since UTF-8 is supposed to be the "official" character encoding, I would expect that a properly-encoded request would contain nothing but valid ASCII characters, which means that 0xc2 0xae need to be %-encoded to become "%c2%ae". > 4. You code fragment is broken and won't compile: there are none > "print" methods in java.io.OutputStream. > > OutputStream works with byte[] and the method name is "write". Yes, it was hastily-typed from memory. The true code compiles and runs as expected. > 5. Wikipedia: > https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www - -form-urlencoded_type > > Wikipedia mentions XForms spec, -> > https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode Thanks > for the XForms reference... it's nice that it has a real example (including a non-ASCII character) instead of the usual trivial examples in the HTTP and HTML specs, for instance. > 6. You can test with real browsers. I will certainly be doing that. https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode The vendor has responded with (paraphrasing) "it seems we don't completely follow this standard; we're considering what to do next, which may include no change". This is a big vendor with *lots* of software clients, so maintaining backward compatibility is going to be a big deal for them. I've got some tricks up my sleeve if they decide not to change anything. Hooray for specs. :( Thanks, - -chris -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCAAGBQJYLooBAAoJEBzwKT+lPKRYnJ0P/1rWnTVK2fCTgTvdXCWwJk1j fU36e2FBoEf+DEB7CuIGD0Yxoegkd09oMD5O7oKeK9Z0c8O9UTJbiF1hK2FXtFxM gTA+PJNMlqglYvKOecdp9x7xmuNB1MBhZDTuqob16qBBD4ujChvns2SnANrDxdO8 zsZBTivT/LJxKnH2Q4tEe65trFjreplCHq1RnAkEYcDjQ85FkjE3+Msc9Wc3TUSX 4FAbeRjdKRn2NzzjYUeZdjKQ/aP+VeCHnWjvhVTZuY8H7fTMOq/Z6IbT3SqB1Pnt endFVkV0czn2LbvK2F6Y6Mg0swwbKuw0nUnidvAtaxUQE3qobRehP0Anv4mdJlH9 yMS8EunQZqhgTNRzzVF6wsleEG6DciJBJQaCeWo9/964x2Y7+k9sf8lE0/jpDgdH H++HtFny+FE8QzNp5tmq/g3ai1ivIGWCzZl7KaPLI2rpXH0W6gbXeDlwpBhHjkEn IPgNBVnb+CCDAvbzogvi6Bv79Dr2WqYE9fdoQfH+X0q1i+LY6mkaHzZKCr7B7vWi Vk3FXmVoz5P8YyT1AZg9bGWkKRhuMJcd+yFm2Xtc/KE+5N48Swt3B2isrAZ9jSdS pUVc6tIAxLuoxXp9tP/RVyNWrVAu6iPPwLuSg4vgAp38+wl5ohAIjRd9dZEBOkM1 lm1cJrg8T8Xim39Z54Du =i5zQ -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org