I think PHP should be consistent in using a given Unicode version in each
release.

Improvements in PHP 7.0, especially IntlChar, allows us to do properly
various things that required hackery with preg and mbstring in the past. But
both approaches will have to coexist for the foreseeable future. So I think
it's desirable to have all three exts based on the same version of Unicode.
Weirdness can otherwise arise, for example one API says a code point is
unassigned and another that it is a lowercase letter. (Given the importance
of validating strings in many PHP apps, this is relevant.)

So it would be good if a UCD upgrade in any given PHP point release would
apply to all three: mbstring, intl and pcre.

The status for 7.0 doesn't look too bad
1. A recent commit updating ext/mbstring/unicode_data.h to Unicode 8.0.0
appears to be in 7.0.0RC4.
2. I think intl is using ICU4C 55.1 which also uses Unicode 7.0. ICU 56RC
implementing Unicode 8.0.0 is available but it seems unlikely 56 will be
ready in time for PHP 7.0.
3. The version of PCRE in ext/pcre/pcrelib uses Unicode 7.0, and, as I
pointed out last week (seemingly to nobody's interest) is probably never
going to upgrade.
4. Tables in ext/standard/html_tables are based on Unicode 3.0 but I doubt
they have ever been affected by a Unicode upgrade.
5. I'm not sure where else the UCD is used in PHP and would be interested to
find out.
#70475 caused mbstring/unicode_data.h to be regenerated from Unicode 8.0.0.
But it is still open and I commented that if it were regenerated using
Unicode 7.0.0 instead then PHP 7.0 could be consistent. PHP 7.0 using
Unicode 7.0.0 throughout is perfectly reasonable, imo.

Do people here agree that PHP should have a *policy* of using a consistent
Unicode version?

This appears to be easy to accomplish for the moment. Moving to Unicode 8
will be harder.

Tom


From:  Tom Worster <f...@thefsb.org>
Date:  Thursday, September 24, 2015 at 9:40 AM
To:  php-internals <internals@lists.php.net>
Subject:  Unicode regex roadmap

While PCRE2 upgraded to Unicode version 8, PCRE, which is in maintenance
mode, will presumably remain on Unicode version 7 indefinitely.

Does PHP have a roadmap for up-to-date regex, either with PCRE2 or some
other lib?

Tom


Reply via email to