I think PHP should be consistent in using a given Unicode version in each release.
Improvements in PHP 7.0, especially IntlChar, allows us to do properly various things that required hackery with preg and mbstring in the past. But both approaches will have to coexist for the foreseeable future. So I think it's desirable to have all three exts based on the same version of Unicode. Weirdness can otherwise arise, for example one API says a code point is unassigned and another that it is a lowercase letter. (Given the importance of validating strings in many PHP apps, this is relevant.) So it would be good if a UCD upgrade in any given PHP point release would apply to all three: mbstring, intl and pcre. The status for 7.0 doesn't look too bad 1. A recent commit updating ext/mbstring/unicode_data.h to Unicode 8.0.0 appears to be in 7.0.0RC4. 2. I think intl is using ICU4C 55.1 which also uses Unicode 7.0. ICU 56RC implementing Unicode 8.0.0 is available but it seems unlikely 56 will be ready in time for PHP 7.0. 3. The version of PCRE in ext/pcre/pcrelib uses Unicode 7.0, and, as I pointed out last week (seemingly to nobody's interest) is probably never going to upgrade. 4. Tables in ext/standard/html_tables are based on Unicode 3.0 but I doubt they have ever been affected by a Unicode upgrade. 5. I'm not sure where else the UCD is used in PHP and would be interested to find out. #70475 caused mbstring/unicode_data.h to be regenerated from Unicode 8.0.0. But it is still open and I commented that if it were regenerated using Unicode 7.0.0 instead then PHP 7.0 could be consistent. PHP 7.0 using Unicode 7.0.0 throughout is perfectly reasonable, imo. Do people here agree that PHP should have a *policy* of using a consistent Unicode version? This appears to be easy to accomplish for the moment. Moving to Unicode 8 will be harder. Tom From: Tom Worster <f...@thefsb.org> Date: Thursday, September 24, 2015 at 9:40 AM To: php-internals <internals@lists.php.net> Subject: Unicode regex roadmap While PCRE2 upgraded to Unicode version 8, PCRE, which is in maintenance mode, will presumably remain on Unicode version 7 indefinitely. Does PHP have a roadmap for up-to-date regex, either with PCRE2 or some other lib? Tom