On Thu, 15 Dec 2022, Rowan Tommins wrote: > On 15/12/2022 15:34, Derick Rethans wrote: > > I have just published an initial draft of the "Unicode Text Processing" > > RFC, a proposal to have performant unicode text processing always > > available to PHP users, by introducing a new "Text" class. > > > > You can find it at: > > https://wiki.php.net/rfc/unicode_text_processing > > > > I'm looking forwards to hearing your opinions, additions, and > > suggestions — the RFC specifically asks for these in places. > > > As others have said already, thank you for taking a stab at this important > topic. I agree that it would be a really useful feature for the language, but > it's also a really difficult one to get right. Here are my initial thoughts... > > # Design Process > > Rather than designing the whole class "on paper", I think this really needs to > be built as a prototype, where we can build up documentation and tests, plug > variations into some real life scenarios, and have separate discussions about > different details. If we limit ourselves initially to features already exposed > by ext/intl (I think everything proposed so far is?), a prototype doesn't even > need to be an extension, it can be in pure PHP. Then once the design is > finalised, you have a ready-made polyfill for older PHP versions, and a set of > tests for the native version :)
I do not want a polyfill. These already exist for intl and friends. I had no intention to design everything up front though, and it is likely that I missed useful methods. This is not going to be right in a single implementation. > # UTF-8 on the outside, UTF-16 on the inside > > I know this will be a very common combination, but it feels odd that an > application which actually wanted to work with UTF-16 would need to perform > round-trips through UTF-8 just to use this class. It should at least be > possible to specify the encoding on input and output. I disgree. Users should not care what is used in the implementation. It's only UTF-16 because that is what ICU's API use. I do not want the complexity of having different in/ex encodings. Perhaps 15 years ago that was useful to have, but right now, everything should be UTF-8 on the interface layer, that is, if you care about internationalisation. > # Internationalisation > > Having locale and collation as state on the object, rather than > parameters on relevant methods, feels like muddling responsibilities. > It makes it hard to reason about what exactly some of the methods will > do: Can I trust that this object will give me a sensible result from > compareWith, or has it been assigned a collation somewhere else? What > exactly will be the definition of "replace" or "contains" for this > pair of objects? A locale/collator is an inherent property of Text (we're dealing with Text here, not strings). I do need to tidy up the wording about what locales and collations are, as I've so far used them sparingly interchangably. > How users will work with these also needs careful thought - your first listed > design goal is "keep it simple", but under locales and Internationalisation is > the worrying sentence "This will require extensive documentation". This phrase is meant to mean that the *format of the locale/collator name* needs extensive documentation. > One function that I would really like to see, for instance, is a > grapheme-aware version of mb_strcut, to solve tasks like: "encode this > abstract Unicode string as UTF-16BE, truncated to at most 200 bytes, > without breaking apart any grapheme clusters". For that to work, you need a methods that instantly returns UTF-8 strings, and not UTF-16. In the RFC, the current subString() uses int $length to mean grapheme clusters. Adding another methods to do something else, is of course possible. I'll think about it (and noted in "Open Issues"). cheers, Derick -- https://derickrethans.nl | https://xdebug.org | https://dram.io Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support Host of PHP Internals News: https://phpinternals.news mastodon: @derickr@phpc.social @xdebug@phpc.social twitter: @derickr and @xdebug
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php