RE: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

Konstantin Preißer Sun, 30 Jul 2017 01:59:52 -0700

Hi Mark,

> -----Original Message-----
> From: Mark Thomas [mailto:ma...@apache.org]
> Sent: Saturday, July 29, 2017 2:56 PM
> 
>> (...)
>> 
> >Why would Tomcat want to modify static files, instead of just serving
> >them as-is?
> 
> Because Tomcat now checks the response encoding and the file encoding
> and converts if necessary.
> 
> You probably want to set the fileEncoding init param of the Default servlet to
> UTF-8.


Thanks. So I set the following parameter in web.xml:
        <init-param>
            <param-name>fileEncoding</param-name>
            <param-value>utf-8</param-value>
        </init-param>

The result now is, that Tomcat converts the static file without a BOM from 
UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML page 
will still be broken, as the brower expects them to be UTF-8-encoded ...

I honestly don't understand that change. As a web developer, I expect a web 
server to serve static files exactly as-is, without trying to convert the files 
into another charset and without trying to detect the charset of the file 
(unless the server is configured to do so).

Bug 49464 [1] mentions that "As per spec the encoding of the page is asssumed 
to be iso-8859-1.". Do I understand correctly that this refers to the following 
section "3.7.1 Canonicalization and Text Defaults" of RFC2616?

    (...) 
   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP.


But not that RFC7231 says in "Appendix B.  Changes from RFC 2616":

   The default charset of ISO-8859-1 for text media types has been
   removed; the default is now whatever the media type definition says.
   Likewise, special treatment of ISO-8859-1 has been removed from the
   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)


I found a following page that talks about this change [2] and mentions RFC6657 
[3] that describes text/* media registrations with charset handling.

While RFC6657 seems to indicate that the default charset of text/plain is 
US-ASCII (which is not what browsers do), it doesn't seem to indicate a default 
charset for other types like text/html, text/javascript, application/javascript 
etc.

Browsers (I tested with IE, Firefox and Chrome) already handle the encoding of 
text-based files where the Content-Type doesn't specify a charset as the user 
would expect:
- For example, with text/html files that don't contain a BOM, they will respect 
the <meta charset=...> element. If a UTF-8 BOM is present, they will interpret 
it as UTF-8.
- If you directly open text/plain, text/css, application/javascript files in a 
browser, they will check if the file has an UTF-8 BOM, and interpret it as 
UTF-8 in that case; otherwise, they seem to interpret it as 
ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm not 
exactly sure about that).
- However, if such files (.css and .js) are referenced by a HTML file, browsers 
will interpret them in the same encoding that the HTML file (if they don't have 
a BOM), which means if the HTML uses UTF-8, they will interpret .js and .css 
also as UTF-8 (unless the HTML element uses a charset parameter, e.g. <script 
src="script.js" charset="windows-1252"></script>).

Therefore, I don't see why Tomcat would want to convert static resources to 
other encodings. (I think it should also not try to detect the charset of files 
and then include a "; charset=..." parameter in the Content-Type, as this may 
override the browser's behavior and thus also lead to incorrect decoding of 
JavaScript files that are encoded with UTF-8 without a BOM).


Further, as an system administrator, I would expect that I can update Tomcat 
from x.y.z to x.y.(z+n), without static JavaScript files suddenly getting 
broken (which isn't immediately obvious as mostly the script per se will work, 
only that some special string characters outside of ASCII are displayed 
incorrectly to the user).
Shouldn't such behavior changes be reserved for the next major/minor version 
which is not yet stable, in this case Tomcat 9.0.0?


Thanks!

Regards,
Konstantin Preißer


[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=49464
[2] https://github.com/requests/requests/issues/2086
[3] https://tools.ietf.org/html/rfc6657



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

Reply via email to