Re: [PHP-DEV] [RFC] Unicode Text Processing

Derick Rethans Fri, 16 Dec 2022 05:56:00 -0800

On Thu, 15 Dec 2022, Rowan Tommins wrote:

> On 15/12/2022 15:34, Derick Rethans wrote:
> > I have just published an initial draft of the "Unicode Text Processing"
> > RFC, a proposal to have performant unicode text processing always
> > available to PHP users, by introducing a new "Text" class.
> > 
> > You can find it at:
> > https://wiki.php.net/rfc/unicode_text_processing
> > 
> > I'm looking forwards to hearing your opinions, additions, and
> > suggestions — the RFC specifically asks for these in places.
> 
> 
> As others have said already, thank you for taking a stab at this important
> topic. I agree that it would be a really useful feature for the language, but
> it's also a really difficult one to get right. Here are my initial thoughts...
> 
> # Design Process
> 
> Rather than designing the whole class "on paper", I think this really needs to
> be built as a prototype, where we can build up documentation and tests, plug
> variations into some real life scenarios, and have separate discussions about
> different details. If we limit ourselves initially to features already exposed
> by ext/intl (I think everything proposed so far is?), a prototype doesn't even
> need to be an extension, it can be in pure PHP. Then once the design is
> finalised, you have a ready-made polyfill for older PHP versions, and a set of
> tests for the native version :)


I do not want a polyfill. These already exist for intl and friends. I 
had no intention to design everything up front though, and it is likely 
that I missed useful methods. This is not going to be right in a single 
implementation.

> # UTF-8 on the outside, UTF-16 on the inside
> 
> I know this will be a very common combination, but it feels odd that an
> application which actually wanted to work with UTF-16 would need to perform
> round-trips through UTF-8 just to use this class. It should at least be
> possible to specify the encoding on input and output.

I disgree. Users should not care what is used in the implementation. 
It's only UTF-16 because that is what ICU's API use. I do not want the 
complexity of having different in/ex encodings. Perhaps 15 years ago 
that was useful to have, but right now, everything should be UTF-8 on 
the interface layer, that is, if you care about internationalisation.

> # Internationalisation
> 
> Having locale and collation as state on the object, rather than 
> parameters on relevant methods, feels like muddling responsibilities. 
> It makes it hard to reason about what exactly some of the methods will 
> do: Can I trust that this object will give me a sensible result from 
> compareWith, or has it been assigned a collation somewhere else? What 
> exactly will be the definition of "replace" or "contains" for this 
> pair of objects?

A locale/collator is an inherent property of Text (we're dealing with 
Text here, not strings). I do need to tidy up the wording about what 
locales and collations are, as I've so far used them sparingly 
interchangably.

> How users will work with these also needs careful thought - your first listed
> design goal is "keep it simple", but under locales and Internationalisation is
> the worrying sentence "This will require extensive documentation".

This phrase is meant to mean that the *format of the locale/collator 
name* needs extensive documentation.

> One function that I would really like to see, for instance, is a 
> grapheme-aware version of mb_strcut, to solve tasks like: "encode this 
> abstract Unicode string as UTF-16BE, truncated to at most 200 bytes, 
> without breaking apart any grapheme clusters".

For that to work, you need a methods that instantly returns UTF-8 
strings, and not UTF-16. In the RFC, the current subString() uses int 
$length to mean grapheme clusters. Adding another methods to do 
something else, is of course possible. I'll think about it (and noted in 
"Open Issues").

cheers,
Derick

-- 
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] Unicode Text Processing

Reply via email to