> On 25 Nov 2014, at 11:20, Alain Williams <a...@phcomp.co.uk> wrote:
> 
> I think that we need to clarify what we are talking about.
> 
> What Andrea has proposed is a way of writing string constants. These 
> characters
> in these strings will still be 8 bits big, this means that there needs to be
> some way of encoding characters with code points that will not fit in 8 bits.
> The only way of avoiding that would be to use, internally, 32 bit characters 
> --
> which would be a huge change.
> 
> So: we need to have some form of encoding.
> 
> As I started ''a way of writing string constants'' - ie a *compile* time 
> action.
> 
> With the code below it is likely that at *run-time* mb_internal_encoding() has
> been called before the echo is executed or the 'Content-Type:' header 
> specifies
> some encoding.
> 
>> echo "mañana \u{1F602}"; // won't output anything useful if script
>> encoding is not UTF-8
> 
> This is not something that the compiler can guess.

Well, we *do* already have a compile-time system for declaring encoding, the 
declare() construct.

> It is even worse if my proposal of \U{arabic letter alef} types is added, how 
> is
> that encoded ? UTF-8 or iso-8859-6 or .... ?
> 
> So, how do we fix the problem ?
> 
> * mb_internal_encoding($new_encoding) finds every string (variable and 
> constant)
>  and converts from the previous encoding to the $new_encoding.
> 
>  Possible, but horribly slow and would prob break things (eg strings that
>  contain binary data).
> 
>  Not a good idea.

I also agree this isn’t a good idea.

> * Decide that UTF-8 is king.
>  That is what I have decided - but I do not have any legacy code to worry 
> about
>  -- being a Brit I don't have to worry much.
> 
> * Rely on the programmer to understand encoding and know what the eventual
>  output encoding will be and if it is not UTF-8 write characters using \Xxx or
>  use mb_convert_encoding($string, $output_encoding, 'utf-8').
> 
> If we decide to support non-utf-8 encoding at compile time then we could 
> extend
> the syntax a bit to allow the encoding to be specified, eg:
> 
>    \U{utf-8: arabic letter alef}
> 
>    \U{iso-8859-6: arabic letter alef}
> 
> Ie, allow this to be optionally specified and terminated by ':'. If not
> specified then assume utf-8.

There are only two sane options:

  * Always UTF-8
  * Whatever source file encoding we’ve specified with declare()

Of those, I’d prefer UTF-8, as nobody’s using UTF-16 or UTF-32.

--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to