Re: [PHP-DEV] [RFC] Unicode Text Processing

Andreas Heigl Thu, 15 Dec 2022 08:06:24 -0800

Hey Derick, Hey all.

On 15.12.22 16:34, Derick Rethans wrote:

Hi,


I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.

You can find it at:
https://wiki.php.net/rfc/unicode_text_processing

I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.


Thanks for tackling this immense topic.

I see a few challenges in the approach. My first question was: Why do we need a new implementation of the ICU library? Creating a userland implementation that wraps the currently existing mb-string and ICU functions into a class that allows better usability shouldn't add that much of a performance penalty. And including the mb-string and the intl extension by default wouldn't hurt.


That way there would be no added maintenance burden on the core developers.

In addition to that it looked to me that there are multiple things mixed up in this Text-class. If we want a Text-class to handle Unicode strings in a better way, why does the string itself need to be Locale-aware? The string itself is a collection of Unicode-Codepoints referencing Characters and Graphemes. Does the string itself need to be aware of a locale to aid in sorting? It needs to be aware of the internal normalization form for character-comparison for sure. But I would rather see a Normalizer handle normalization of the Text-content instead of the Text-class handling that itself. Similarily I'd see the Transliteration done by a separate class. Which then strongly looks similar to the Intl-extension. Which brings me back to the question: Do we really need a second Intl-extension in the core?

I'm ambivalent about this. On the one hand it could make some things for sure easier. On the other hand it adds burden onto the core-developers that could be avoided by providing the intl (and mb-string) extension by default instead of having to add them separately. And then find a group if people willing to build a userland implementation.

And yes, I know the intl-extension is everything but easy to use. Especially in the quirky edge-cases regarding Transliteration and Normalization. But the issue usually isn't using it but finding the appropriate documentation on the ICU page. Helping the ICU to improve on that documentation would also be a huge benefit. To all those trying to use the Intl-extension right now.


But that's just my 0.02€

Cheers

Andreas


cheers,
Derick


--
                                                              ,,,
                                                             (o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl                                                       |
| mailto:andr...@heigl.org                  N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org                                           |
+---------------------------------------------------------------------+
| https://hei.gl/appointmentwithandreas                               |
+---------------------------------------------------------------------+
| GPG-Key: https://hei.gl/keyandreasheiglorg                          |
+---------------------------------------------------------------------+

OpenPGP_signature
Description: OpenPGP digital signature

Re: [PHP-DEV] [RFC] Unicode Text Processing

Reply via email to