On 16/12/22 02:34, Derick Rethans wrote:
Hi,

I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.

Using "collator" and "locale" interchangeably seems imprecise. If the input is an ICU locale string, then I think you should just call it locale. Then the user will be armed with the correct terminology when they go looking for more information in the ICU manual. In ICU, case conversion and BreakIterator need a locale, not a collator.

I'm concerned about the time order of using grapheme offsets. For example, is subString() O(N) in $offset? If the idea is to be easy to use and performant, you don't want to have subtle algorithmic complexity traps.

I'm probably not the target audience for this class, since I'm generally looking for maximum flexibility, not minimum complexity. As such, I'd like intl to have better documentation and more features. The RFC has a family of locale-aware case conversion functions which do not exist in intl. This was raised as an issue during the discussion on my ASCII case conversion RFC. It would be great if intl could get those functions too.

I think you should consider making this Text class a part of the intl extension. You're adding a class which is similar to the classes in that extension. In terms of data, it's like IntlChar, except it's for strings not characters. Its constructor takes an ICU locale string, just like IntlBreakIterator or MessageFormatter.

I can understand if you don't want to follow all the existing conventions of the intl extension. But if that is the rationale for the RFC, I'd like to see a discussion of the specific usability problems with the intl extension.

-- Tim Starling

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to