2016-11-17 17:21 GMT+03:00 Christopher Schultz <ch...@christopherschultz.net>: > All, > > I've got a problem with a vendor and I'd like another opinion just to > make sure I'm not crazy. The vendor and I have a difference of opinion > about how a character should be encoded in an HTTP POST request. > > The vendor's API officially should accept requests in UTF-8 encoding. > We are using application/x-www-form-urlencoded content type. > > I'm trying to send a message with a non-ASCII character -- for > example, a ® (that's (R), the registered trademark symbol). > > The Java code being used to package-up this POST looks something like > this: > > OutputStream out = httpurlconnection.getOutputStream(); > out.print("notetext="); > out.print(URLEncoder.encode("test®", "UTF-8")); > out.close(); > > So the POST payload ends up being notetext=test%C2%AE or, on the wire, > the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43 32 25 41 45. > > The final bytes 25 43 32 25 41 45 are the characters % C 2 % A E. > > Can someone verify that I'm encoding everything correctly? > > The vendor is claiming that ® can be sent "directly" like one might do > using curl: > > $ curl -d 'notetext=®' [url] > > and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae (note > that c2 and ae are "bare" and not %-encoded).
1. That is a wrong way to use curl. The manual says that the argument to -d should be properly urlencoded. The above value is an incorrect one. https://curl.haxx.se/docs/manual.html See "POST (HTTP)" and below. 2. If you are submitting data programmatically, I wonder why you are using simple "application/x-www-form-urlencoded". I think it would be better to use explicit charset argument in the Content-Type value, as it is easy to do so with Java clients. 3. The application/x-www-form-urlencoded encoding was originally specified in HTML specification. Current specification: https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data It defers details to https://url.spec.whatwg.org/#concept-urlencoded-serializer Historic, HTML 4.01: https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 My opinion is that the correct value on the wire is 25 43 32 25 41 45 = % C 2 % A E. If a vendor accepts non-encoded "c2 ae": it technically may work (in some versions of some software), but this is not a standard feature and one would better not rely on it. Technically, if non-encoded bytes ("c2 ae") are accepted, they won't be confused with special character ("=", "&", "+", "%", CRLF), as all multi-byte UTF-8 characters have 0x80 bit set. 4. You code fragment is broken and won't compile: there are none "print" methods in java.io.OutputStream. OutputStream works with byte[] and the method name is "write". 5. Wikipedia: https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www-form-urlencoded_type Wikipedia mentions XForms spec, -> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode 6. You can test with real browsers. Best regards, Konstantin Kolinko --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org