Quoting Tim Bannister <[EMAIL PROTECTED]>:

On Fri, Dec 07, 2007 at 12:00:04PM +0000, [EMAIL PROTECTED] wrote:

> Also, why does Horde/Imp require the user to select the encoding in the
> first place? Desktop clients usually select the most appropriate
> encoding automatically. Kmail, for instance, usually uses either
> US-ASCII or ISO-8859-1. But if i type some Japanese characters into an
> e-mail, it automatically switches to ISO-2022-JP (the most common
> encoding for Japanese e-mail). I'm sending this message from Kmail as
> UTF-8.

PHP doesn't support automatic detecting of charsets, and it's even
much more complicated to detect it client-side, i.e. when typing the
message. We already do choose the most appropriate encoding though,
because we choose the encoding that matches the currently selected
interface language.

However, data outside IS0-8859-1 are usually sent as SGML entities.
That's enough to infer an encoding (UCS-2). If there aren't any entities
and no encoding was specified then it seems reasonable for Horde to
infer ISO-8859-1.

I don't think this is right any more. The form is submitted as multipart/form-data (RFC 2388), but IMP (PHP) often can't tell how the parts are encoded. I'll attach a couple of sample submissions. The two attachments were generated with WebKit (Safari 3.0.4) by varying the Content-Type header sent by compose.php

In my tests with Safari, Firefox and also Internet Explorer I found that the character encoding is not indicated on submission but is consistently derived from the encoding of the document in which the form appears. Well, this varies depending on what the user agent asks for (for example, I get UTF-8), but the key point is it knows it. The thing about entities is a bit of a red herring; it's how IE submits characters it can't directly encode, and other browsers have copied this. The entities get decoded before being used in a message body.

If IMP sets "charset" to the same value set in the HTTP headers for the form, that encoding will be used for the submitted data by the three popular browsers. It seems accurate enough that it could perhaps become a hidden input.

PS. There's some background information in Mozilla bugs 18643 and 228779:
https://bugzilla.mozilla.org/show_bug.cgi?id=18643
https://bugzilla.mozilla.org/show_bug.cgi?id=228779

PPS. Some internationalised text: €, £, $, русский, 日本語。Also, the literal characters "ampersand hash three eight semicolon": &#38;

--
Tim Bannister
IT Services

e: [EMAIL PROTECTED]
w: http://www.manchester.ac.uk/itservices

Attachment: form-data.iso-8859-1
Description: Binary data

Attachment: form-data.utf-8
Description: Binary data

-- 
IMP mailing list - Join the hunt: http://horde.org/bounties/#imp
Frequently Asked Questions: http://horde.org/faq/
To unsubscribe, mail: [EMAIL PROTECTED]

Reply via email to