[ 
https://issues.apache.org/jira/browse/HTTPCLIENT-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Osipov updated HTTPCLIENT-2159:
---------------------------------------
    Description: 
Based on [~reschke]'s, 
[comment|https://issues.apache.org/jira/browse/HTTPCLIENT-2144?focusedCommentId=17310053&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17310053].
 We are treating several content types incorrectly. We have in 
{{org.apache.hc.core5.http.ContentType}} several content types defined which 
are per definition UTF-8 and do not contain any {{charset}} parameter or have 
another form transport encoding. Affected are:

{code}
    public static final ContentType APPLICATION_FORM_URLENCODED = create(
            "application/x-www-form-urlencoded", StandardCharsets.ISO_8859_1);
    public static final ContentType APPLICATION_JSON = create(
            "application/json", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_NDJSON = create(
            "application/x-ndjson", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_PDF = create(
            "application/pdf", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_PROBLEM_JSON = create(
            "application/problem+json", StandardCharsets.UTF_8);
    public static final ContentType MULTIPART_FORM_DATA = create(
            "multipart/form-data", StandardCharsets.ISO_8859_1);
    public static final ContentType MULTIPART_MIXED = create(
            "multipart/mixed", StandardCharsets.ISO_8859_1);
    public static final ContentType MULTIPART_RELATED = create(
            "multipart/related", StandardCharsets.ISO_8859_1);
    public static final ContentType TEXT_HTML = create(
            "text/html", StandardCharsets.ISO_8859_1);
    public static final ContentType TEXT_EVENT_STREAM = create(
            "text/event-stream", StandardCharsets.UTF_8);
{code}

* {{application/x-www-form-urlencoded}}: Does not have a charset parameter: 
https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded. 
HTML5 defines https://url.spec.whatwg.org/#urlencoded-serializing how to apply 
alternative encoding, but UTF-8 is standard.
* {{application/json}}, {{application/x-ndjson}}, {{application/problem+json}}: 
There is no charset definition because JSON is *always* UTF-8. The charset 
paremeter has no meaning: 
https://datatracker.ietf.org/doc/html/rfc8259#section-11
* {{application/pdf}}: This is binary encoding, no charset
* {{text/event-stream}}: Defined *always* as UTF-8: 
https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events-intro
* {{text/html}}: https://html.spec.whatwg.org/ does not define ISO-8859-1 to be 
the default encoding. it says that encoding must be supplied by some means and 
an algorithm is applied to find it. It seems that UTF-8 is expected these days.
* {{multipart/mixed}}: Does not have a charset parameter, it is up to the parts 
to supply proper encoding to perform byte-to-char conversion: 
https://datatracker.ietf.org/doc/html/rfc2046
* {{multipart/related}}: Does not have a charset parameter, it is up to the 
parts to supply proper encoding to perform byte-to-char conversion: 
https://datatracker.ietf.org/doc/html/rfc2387
* {{multipart/form-data}}: Does not have a charset parameter, the RFC defines a 
{{_charset_}} form field for that: 
https://datatracker.ietf.org/doc/html/rfc7578#section-4.6

{{charset}} applies to the transport layer only and never to the semantics of 
the content-type. E.g., {{application/x-www-form-urlencoded}}.

  was:
Based on [~reschke]'s, 
[comment|https://issues.apache.org/jira/browse/HTTPCLIENT-2144?focusedCommentId=17310053&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17310053].
 We are treating several content types incorrectly. We have in 
{{org.apache.hc.core5.http.ContentType}} several content types defined which 
are per definition UTF-8 and do not contain any {{charset}} parameter or have 
another form transport encoding. Affected are:

{code}
    public static final ContentType APPLICATION_FORM_URLENCODED = create(
            "application/x-www-form-urlencoded", StandardCharsets.ISO_8859_1);
    public static final ContentType APPLICATION_JSON = create(
            "application/json", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_NDJSON = create(
            "application/x-ndjson", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_PDF = create(
            "application/pdf", StandardCharsets.UTF_8);
    public static final ContentType APPLICATION_PROBLEM_JSON = create(
            "application/problem+json", StandardCharsets.UTF_8);
    public static final ContentType MULTIPART_FORM_DATA = create(
            "multipart/form-data", StandardCharsets.ISO_8859_1);
    public static final ContentType MULTIPART_MIXED = create(
            "multipart/mixed", StandardCharsets.ISO_8859_1);
    public static final ContentType MULTIPART_RELATED = create(
            "multipart/related", StandardCharsets.ISO_8859_1);
    public static final ContentType TEXT_HTML = create(
            "text/html", StandardCharsets.ISO_8859_1);
    public static final ContentType TEXT_EVENT_STREAM = create(
            "text/event-stream", StandardCharsets.UTF_8);
{code}

* {{application/x-www-form-urlencoded}}: Does not have a charset parameter: 
https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded. 
HTML5 defines https://url.spec.whatwg.org/#urlencoded-serializing how to apply 
alternative encoding, but UTF-8 is standard.
* {{application/json}}, {{"application/x-ndjson}}, 
{{application/problem+json}}: There is not charset definition because JSON is 
*always* UTF-8. The charset paremeter has no meaning: 
https://datatracker.ietf.org/doc/html/rfc8259#section-11
* {{application/pdf}}: This is binary encoding, no charset
* {{text/event-stream}}: Defined *always* as UTF-8: 
https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events-intro
* {{text/html}} https://html.spec.whatwg.org/ does not define ISO-8859-1 to be 
a default encoding. it says that encoding must be supplied by some means and an 
algorithm is applied to find it. It seems that UTF-8 is expected these days.
* {{multipart/mixed}}: Does not have a charset parameter, it is up to the parts 
to supply proper encoding to perform byte-to-char conversion: 
https://datatracker.ietf.org/doc/html/rfc2046
* {{multipart/related}}: Does not have a charset parameter, it is up to the 
parts to supply proper encoding to perform byte-to-char conversion: 
https://datatracker.ietf.org/doc/html/rfc2387
* {{multipart/form-data}}: Does not have a charset parameter, the RFC defines a 
{{_charset_}} form field for that: 
https://datatracker.ietf.org/doc/html/rfc7578#section-4.6

{{charset}} applies to the transport layer only and never to the semantics of 
the content-type. E.g., {{application/x-www-form-urlencoded}}.


> Invalid handling of charset content type parameter
> --------------------------------------------------
>
>                 Key: HTTPCLIENT-2159
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2159
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpCache
>            Reporter: Michael Osipov
>            Priority: Major
>
> Based on [~reschke]'s, 
> [comment|https://issues.apache.org/jira/browse/HTTPCLIENT-2144?focusedCommentId=17310053&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17310053].
>  We are treating several content types incorrectly. We have in 
> {{org.apache.hc.core5.http.ContentType}} several content types defined which 
> are per definition UTF-8 and do not contain any {{charset}} parameter or have 
> another form transport encoding. Affected are:
> {code}
>     public static final ContentType APPLICATION_FORM_URLENCODED = create(
>             "application/x-www-form-urlencoded", StandardCharsets.ISO_8859_1);
>     public static final ContentType APPLICATION_JSON = create(
>             "application/json", StandardCharsets.UTF_8);
>     public static final ContentType APPLICATION_NDJSON = create(
>             "application/x-ndjson", StandardCharsets.UTF_8);
>     public static final ContentType APPLICATION_PDF = create(
>             "application/pdf", StandardCharsets.UTF_8);
>     public static final ContentType APPLICATION_PROBLEM_JSON = create(
>             "application/problem+json", StandardCharsets.UTF_8);
>     public static final ContentType MULTIPART_FORM_DATA = create(
>             "multipart/form-data", StandardCharsets.ISO_8859_1);
>     public static final ContentType MULTIPART_MIXED = create(
>             "multipart/mixed", StandardCharsets.ISO_8859_1);
>     public static final ContentType MULTIPART_RELATED = create(
>             "multipart/related", StandardCharsets.ISO_8859_1);
>     public static final ContentType TEXT_HTML = create(
>             "text/html", StandardCharsets.ISO_8859_1);
>     public static final ContentType TEXT_EVENT_STREAM = create(
>             "text/event-stream", StandardCharsets.UTF_8);
> {code}
> * {{application/x-www-form-urlencoded}}: Does not have a charset parameter: 
> https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded.
>  HTML5 defines https://url.spec.whatwg.org/#urlencoded-serializing how to 
> apply alternative encoding, but UTF-8 is standard.
> * {{application/json}}, {{application/x-ndjson}}, 
> {{application/problem+json}}: There is no charset definition because JSON is 
> *always* UTF-8. The charset paremeter has no meaning: 
> https://datatracker.ietf.org/doc/html/rfc8259#section-11
> * {{application/pdf}}: This is binary encoding, no charset
> * {{text/event-stream}}: Defined *always* as UTF-8: 
> https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events-intro
> * {{text/html}}: https://html.spec.whatwg.org/ does not define ISO-8859-1 to 
> be the default encoding. it says that encoding must be supplied by some means 
> and an algorithm is applied to find it. It seems that UTF-8 is expected these 
> days.
> * {{multipart/mixed}}: Does not have a charset parameter, it is up to the 
> parts to supply proper encoding to perform byte-to-char conversion: 
> https://datatracker.ietf.org/doc/html/rfc2046
> * {{multipart/related}}: Does not have a charset parameter, it is up to the 
> parts to supply proper encoding to perform byte-to-char conversion: 
> https://datatracker.ietf.org/doc/html/rfc2387
> * {{multipart/form-data}}: Does not have a charset parameter, the RFC defines 
> a {{_charset_}} form field for that: 
> https://datatracker.ietf.org/doc/html/rfc7578#section-4.6
> {{charset}} applies to the transport layer only and never to the semantics of 
> the content-type. E.g., {{application/x-www-form-urlencoded}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
For additional commands, e-mail: dev-h...@hc.apache.org

Reply via email to