With respect to the encoding of a form, several concepts are being mixed
together.
Here is my brief take on it.

First, if nothing else is done, the form will be in the encoding of the
containing web page.
It is true that the http request to the server will not contain a
content-type charset identification.

However, the HTML FORM element allows the user to specify the accept-charset
attribute to identify the encodings the server will understand. The
recommended way this is used is to specify a single encoding.
The browser then will/should put the data in that encoding for the server to
use.
It is not on the http request by default, you have to specify it in the
form.

You might do this to avoid problems where the user changes the encoding they
use to view the page.
(The data is still sent in the accept-charset encoding.)

I believe there is general agreement among i18n folks that this attribute
would have been better named as charset with only a single argument.
Hindsight...

Now the comment about URLs is also true. The path after the domain name up
to the query should be utf-8. The remaining query portion can be in any
encoding and converting it to utf-8 would have broken existing cgi and other
programs that parse the query. So it is converted only to hex encoding,
using the current encoding. However, the current encoding I believe would be
the single encoding of the form, if the accept-charset was used in the form.

Finally, we have to consider IE. For some reason, and despite their
documents saying they do send accept-charset, IE NEVER sends accept-charset.
So if you test with IE you are misled.

I wrote a little test php program that has 2 identical forms. You enter text
in either form and it posts and displays the hex codes for the bytes. The
first form does not set accept-charset, so it defaults to utf-8.
The second form overrides the page encoding and sets accept-charset to
windows-1252.
http://www.xencraft.com/php/testforms.php

If you use IE, since it never honors accept-charset, both forms behave the
same and displays utf-8 byte values. On Netscape, the second form converts
the characters to windows-1252, so the characters look munged, and the
codepoints show windows1252 values.

In summary, if the accept-charset is supplied on a request, and it contains
a single encoding, we should use it as the encoding of the form.

On IE you will never see it. But other browsers that are standards compliant
(netscape was very strong in this area) will.

I only tried POST. Somebody else might try it with GET.


Tex Texin
Internationalization Architect,   Yahoo! Inc.
 
 


> -----Original Message-----
> From: Andrei Zmievski [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, August 24, 2005 4:23 PM
> To: Makoto Tozawa
> Cc: [EMAIL PROTECTED]; PHP Developers Mailing List
> Subject: Re: [PHP-DEV] Re: PHP Unicode support design document
> 
> 
> Hi,
> 
> On Aug 23, 2005, at 7:30 PM, Makoto Tozawa wrote:
> 
> > "HTTP Input Encoding
> > ...
> > If the HTTP request contains the encoding specification in the 
> > headers, then it will be used instead of this setting."
> >
> > With my best knowledge there isn't such http request header which 
> > specifies the encoding of the request. In case the intent 
> is to honor 
> > the ACCEPT-CHARSET, it may cause a problem because browsers don't 
> > gurantee the encoding in the ACCEPT-CHARSET is same as the encoding 
> > used to escape characters in the URL query string. After all, the 
> > ACCEPT-CHARSET is to specify the character encodings acceptable for 
> > the response.
> 
> I took a closer look at this today and RFC 2616 does not specify 
> whether user agents are supposed to send a charset parameter in the 
> Content-Type header of the POST request. I did not see any of my 
> browsers doing so. I think we can safely disregard this and rely on 
> http_input_encoding and output_encoding settings. We are not going to 
> use Accept-Charset for the reasons you mention.
> 
> > Is there any way to keep the byte semantics (in oppose to unicode
> > semantics)
> > only for the existing functions? For example, the Oracle 8 
> functions 
> > can be
> > configured to use utf-8 for the character encoding of strings. In 
> > order for
> > them to work properly, fundamental functions, which Oracle 
> 8 function 
> > call,
> > have to behave in byte samentics. And if they work properly 
> when the 
> > unicode
> > semantics switch is turned on, by setting the runtime_encoding to 
> > utf-8,
> > they can be called by uncode applications.
> 
> I couldn't parse this on the first try. Could you restate this?
> 
> -Andrei
> 
> -- 
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
> 
> 
> 

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to