Shanti Suresh wrote:
Hi Chris,

This is such an interesting discussion.  I am not sure what to make of this
person's comment:

-------------------
TAXI   2012-10-09 09:03:59 PDT

Wow, no fix since 8 years...

And this is a real bug: If the HTTP header says the file is encoded in
ISO-8859-1 the common way to override this with HTML is:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Firefox reads the body in UTF-8 then, which is fine, but the charset
used in forms is still ISO-8859-1, so you have to add
accept-charset="utf-8" to the form just for firefox (other browser
automatically use UTF-8 or send the charset with the content-type).

So: Why the hell is nobody fixing this bug?
---------------


So the questions I have are:
(1) Firefox is not properly sending UTF-8 in the POST request even if it
reads the HTML page in UTF-8?  And other browsers are now sending
"charset=utf-8" based on the the HTML META tag?
(2) Firefox has started respecting the accept-charset="utf-8" attribute in
forms now such that it adds charset to the Content-Type header of the POST
request?   I'm confused.  I thought Mozilla was not going to fix  this
issue.

Thanks for any clarifications.


I think that you are still confused.. :-)
(As are, in part, some of the people who posted on that Mozilla bug).

(1) browsers, in general, are *not* sending a "charset" attribute in their POST submissions (whether form-url-encoded or multipart). This is a real pity, because it is the source of much confusion, and the real reason why servers have to go through loops to figure out (or force) the character set/encoding of the data that they are getting from browser POSTs. And the Mozilla people seem to say that it is that way, because when they tried to add this "charset" attribute, it broke a number of server applications at the time (8 years ago), and they see no reason to think that it would not still be the same today, so they arer not trying it again.

(1a) what browsers *will* do, in general, is to send POST data in the same character set/encoding as the one of the HTML *page* which contains the form being posted. But, even when sending UTF-8 encoded data according to this principle, they are *not* indicating that it is UTF-8 data, which is basically wrong, because the standard HTTP/HTML character set is iso-8859-1, and they *should* indicate it when that is not what they are sending. But that is the reality.

(2) the "accept-charset" attribute of a <form> does not mean that this <form> will *send* data according to that charset/encoding. It indicates that any data that is entered in the form's input boxes will be interpreted as being in that charset. So the fact of adding an "accept-charset" attribute to your <form> tags does not make it so that the browser will magically change its behaviour when POSTing data.

In other words, it's a mess, and the mess is mainly due to some lack of precision in the original RFC's, but it is being perpetuated now by the fear of browser developers of breaking server applications by doing things right. Which is rather funny in a way, considering all the things that browser developers do all the time anyway which do break existing applications.

We really need an RFC for HTTP 2.0, with UTF-8 as the default charset/encoding.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to