2016-11-17 17:21 GMT+03:00 Christopher Schultz <ch...@christopherschultz.net>:
> All,
>
> I've got a problem with a vendor and I'd like another opinion just to
> make sure I'm not crazy. The vendor and I have a difference of opinion
> about how a character should be encoded in an HTTP POST request.
>
> The vendor's API officially should accept requests in UTF-8 encoding.
> We are using application/x-www-form-urlencoded content type.
>
> I'm trying to send a message with a non-ASCII character -- for
> example, a ® (that's (R), the registered trademark symbol).
>
> The Java code being used to package-up this POST looks something like
> this:
>
> OutputStream out = httpurlconnection.getOutputStream();
> out.print("notetext=");
> out.print(URLEncoder.encode("test®", "UTF-8"));
> out.close();
>
> So the POST payload ends up being notetext=test%C2%AE or, on the wire,
> the bytes are 6e 6f 74 65 74 65 78 74 3d 74 65 73 74 25 43 32 25 41 45.
>
> The final bytes 25 43 32 25 41 45 are the characters % C 2 % A E.
>
> Can someone verify that I'm encoding everything correctly?
>
> The vendor is claiming that ® can be sent "directly" like one might do
> using curl:
>
> $ curl -d 'notetext=®' [url]
>
> and the bytes on the wire are 6e 6f 74 65 74 65 78 74 3d c2 ae (note
> that c2 and ae are "bare" and not %-encoded).

1. That is a wrong way to use curl.  The manual says that the argument
to -d should be properly urlencoded. The above value is an incorrect
one.

https://curl.haxx.se/docs/manual.html
See "POST (HTTP)" and below.

2. If you are submitting data programmatically, I wonder why you are
using simple "application/x-www-form-urlencoded".

I think it would be better to use explicit charset argument in the
Content-Type value, as it is easy to do so with Java clients.

3. The application/x-www-form-urlencoded encoding was originally
specified in HTML specification.

Current specification:
https://www.w3.org/TR/html51/sec-forms.html#urlencoded-form-data

It defers details to
https://url.spec.whatwg.org/#concept-urlencoded-serializer


Historic, HTML 4.01:
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1


My opinion is that the correct value on the wire is
25 43 32 25 41 45 = % C 2 % A E.


If a vendor accepts non-encoded "c2 ae":
it technically may work (in some versions of some software), but this
is not a standard feature and one would better not rely on it.

Technically, if non-encoded bytes ("c2 ae") are accepted, they won't
be confused with special character ("=", "&", "+", "%", CRLF), as all
multi-byte UTF-8 characters have 0x80 bit set.


4. You code fragment is broken and won't compile: there are none
"print" methods in java.io.OutputStream.

OutputStream works with byte[] and the method name is "write".


5. Wikipedia:
https://en.wikipedia.org/wiki/Percent-encoding#The_application.2Fx-www-form-urlencoded_type

Wikipedia mentions XForms spec,
-> https://www.w3.org/TR/2007/REC-xforms-20071029/#serialize-urlencode

6. You can test with real browsers.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to