Hi!

> Just a quick point: most of the core is not ASCII. PHP strings are byte 
> strings, completely divorced from any encoding. A few native functions 
> assume ISO8859-1 (or possibly Windows CP1252), but mostly they just 
> juggle which ever bytes you give them.

True, but not all extensions and functions behave this way. Some
(especially with intl, but not only) assume it's utf-8, for example, and
for some utf-8 is a changeable default, which in practice often becomes
the used encoding since people are not aware of need to track their
encoding and most of them do use utf-8 anyway.

> The main exception I can think of is that numbers are often handled 
> specially, with digits and separators as defined by ASCII. But since 
> we're talking UTF-8, that doesn't need to change.

More interesting case actually is, well, case conversion. We unknowingly
used locale-dependent lowercasing routines until the inevitable
encounter with the dreaded Turkish 'i'. At which point we switched to
forced ASCII. So identifiers in the engine are kind of assumed to be
ASCII, even though you can somethimes sneak non-ASCII past it and it
will work, but weirdly.

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to