> On 25 Nov 2014, at 11:20, Alain Williams <a...@phcomp.co.uk> wrote: > > I think that we need to clarify what we are talking about. > > What Andrea has proposed is a way of writing string constants. These > characters > in these strings will still be 8 bits big, this means that there needs to be > some way of encoding characters with code points that will not fit in 8 bits. > The only way of avoiding that would be to use, internally, 32 bit characters > -- > which would be a huge change. > > So: we need to have some form of encoding. > > As I started ''a way of writing string constants'' - ie a *compile* time > action. > > With the code below it is likely that at *run-time* mb_internal_encoding() has > been called before the echo is executed or the 'Content-Type:' header > specifies > some encoding. > >> echo "mañana \u{1F602}"; // won't output anything useful if script >> encoding is not UTF-8 > > This is not something that the compiler can guess.
Well, we *do* already have a compile-time system for declaring encoding, the declare() construct. > It is even worse if my proposal of \U{arabic letter alef} types is added, how > is > that encoded ? UTF-8 or iso-8859-6 or .... ? > > So, how do we fix the problem ? > > * mb_internal_encoding($new_encoding) finds every string (variable and > constant) > and converts from the previous encoding to the $new_encoding. > > Possible, but horribly slow and would prob break things (eg strings that > contain binary data). > > Not a good idea. I also agree this isn’t a good idea. > * Decide that UTF-8 is king. > That is what I have decided - but I do not have any legacy code to worry > about > -- being a Brit I don't have to worry much. > > * Rely on the programmer to understand encoding and know what the eventual > output encoding will be and if it is not UTF-8 write characters using \Xxx or > use mb_convert_encoding($string, $output_encoding, 'utf-8'). > > If we decide to support non-utf-8 encoding at compile time then we could > extend > the syntax a bit to allow the encoding to be specified, eg: > > \U{utf-8: arabic letter alef} > > \U{iso-8859-6: arabic letter alef} > > Ie, allow this to be optionally specified and terminated by ':'. If not > specified then assume utf-8. There are only two sane options: * Always UTF-8 * Whatever source file encoding we’ve specified with declare() Of those, I’d prefer UTF-8, as nobody’s using UTF-16 or UTF-32. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php