On 15/12/2022 15:34, Derick Rethans wrote:
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.

You can find it at:
https://wiki.php.net/rfc/unicode_text_processing

I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.


As others have said already, thank you for taking a stab at this important topic. I agree that it would be a really useful feature for the language, but it's also a really difficult one to get right. Here are my initial thoughts...

# Design Process

Rather than designing the whole class "on paper", I think this really needs to be built as a prototype, where we can build up documentation and tests, plug variations into some real life scenarios, and have separate discussions about different details. If we limit ourselves initially to features already exposed by ext/intl (I think everything proposed so far is?), a prototype doesn't even need to be an extension, it can be in pure PHP. Then once the design is finalised, you have a ready-made polyfill for older PHP versions, and a set of tests for the native version :)

We might also want to do some general investigation of what other languages and frameworks provide, and which decisions have proven good or bad in practice.

# Lossy Transforms

Automatic normalisation and stripping of BOMs seems useful, but it immediately rules out use of this class for anything where you want to get back what you put in. For instance, if an ORM used Text instances for strings in data models, it would generate extra Update queries on the database even when the string wasn't otherwise changed. I think it would be better to make this easy but explicit.

# UTF-8 on the outside, UTF-16 on the inside

I know this will be a very common combination, but it feels odd that an application which actually wanted to work with UTF-16 would need to perform round-trips through UTF-8 just to use this class. It should at least be possible to specify the encoding on input and output.

Ruby takes an interesting approach where strings are tagged with their current binary encoding, and only converted to another form if actually required. If your input layer says "$name = new Text($_GET['name'], 'Windows-1252');" and your output layer says "echo $name->asBytes('Windows-1252');" the overhead of converting to UTF-16 can be skipped entirely, unless something in between says "$name = $name->wordsToUpper()". This also removes another source of lossy transformation, since some encoding conversions aren't perfectly reversible (e.g. the source encoding has more than one byte sequence mapped to the same Unicode code point).

# Internationalisation

Having locale and collation as state on the object, rather than parameters on relevant methods, feels like muddling responsibilities. It makes it hard to reason about what exactly some of the methods will do: Can I trust that this object will give me a sensible result from compareWith, or has it been assigned a collation somewhere else? What exactly will be the definition of "replace" or "contains" for this pair of objects?

How users will work with these also needs careful thought - your first listed design goal is "keep it simple", but under locales and Internationalisation is the worrying sentence "This will require extensive documentation". This is one of those places where "doing it right" is really hard to combine with "making it easy", because language is inherently complex, but users will expect a simple answer to "how do I make it case-insensitive?"

# Allowing other abstractions

I 100% approve of your use of grapheme clusters, rather than code points, as the primary unit; so many implementations get that wrong. However, when interacting with other systems, reasoning about bytes (or sometimes even codepoints) is essential.

One function that I would really like to see, for instance, is a grapheme-aware version of mb_strcut, to solve tasks like: "encode this abstract Unicode string as UTF-16BE, truncated to at most 200 bytes, without breaking apart any grapheme clusters".


Thanks again for getting the ball rolling, and I look forward to helping iterate the design.

Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to