Shanti Suresh wrote:
Hi Chris,
This is such an interesting discussion. I am not sure what to make of this
person's comment:
-------------------
TAXI 2012-10-09 09:03:59 PDT
Wow, no fix since 8 years...
And this is a real bug: If the HTTP header says the file is encoded in
ISO-8859-1 the common way to override this with HTML is:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Firefox reads the body in UTF-8 then, which is fine, but the charset
used in forms is still ISO-8859-1, so you have to add
accept-charset="utf-8" to the form just for firefox (other browser
automatically use UTF-8 or send the charset with the content-type).
So: Why the hell is nobody fixing this bug?
---------------
So the questions I have are:
(1) Firefox is not properly sending UTF-8 in the POST request even if it
reads the HTML page in UTF-8? And other browsers are now sending
"charset=utf-8" based on the the HTML META tag?
(2) Firefox has started respecting the accept-charset="utf-8" attribute in
forms now such that it adds charset to the Content-Type header of the POST
request? I'm confused. I thought Mozilla was not going to fix this
issue.
Thanks for any clarifications.
I think that you are still confused.. :-)
(As are, in part, some of the people who posted on that Mozilla bug).
(1) browsers, in general, are *not* sending a "charset" attribute in their POST
submissions (whether form-url-encoded or multipart).
This is a real pity, because it is the source of much confusion, and the real reason why
servers have to go through loops to figure out (or force) the character set/encoding of
the data that they are getting from browser POSTs.
And the Mozilla people seem to say that it is that way, because when they tried to add
this "charset" attribute, it broke a number of server applications at the time (8 years
ago), and they see no reason to think that it would not still be the same today, so they
arer not trying it again.
(1a) what browsers *will* do, in general, is to send POST data in the same character
set/encoding as the one of the HTML *page* which contains the form being posted.
But, even when sending UTF-8 encoded data according to this principle, they are *not*
indicating that it is UTF-8 data, which is basically wrong, because the standard HTTP/HTML
character set is iso-8859-1, and they *should* indicate it when that is not what they are
sending. But that is the reality.
(2) the "accept-charset" attribute of a <form> does not mean that this <form> will *send*
data according to that charset/encoding. It indicates that any data that is entered in
the form's input boxes will be interpreted as being in that charset.
So the fact of adding an "accept-charset" attribute to your <form> tags does not make it
so that the browser will magically change its behaviour when POSTing data.
In other words, it's a mess, and the mess is mainly due to some lack of precision in the
original RFC's, but it is being perpetuated now by the fear of browser developers of
breaking server applications by doing things right.
Which is rather funny in a way, considering all the things that browser developers do all
the time anyway which do break existing applications.
We really need an RFC for HTTP 2.0, with UTF-8 as the default charset/encoding.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org