Re: [PHP-DEV] [RFC] Unicode Text Processing

Rowan Tommins Thu, 15 Dec 2022 14:20:30 -0800

On 15/12/2022 15:34, Derick Rethans wrote:

I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.


You can find it at:
https://wiki.php.net/rfc/unicode_text_processing

I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.

As others have said already, thank you for taking a stab at thisimportant topic. I agree that it would be a really useful feature forthe language, but it's also a really difficult one to get right. Hereare my initial thoughts...


# Design Process

Rather than designing the whole class "on paper", I think this reallyneeds to be built as a prototype, where we can build up documentationand tests, plug variations into some real life scenarios, and haveseparate discussions about different details. If we limit ourselvesinitially to features already exposed by ext/intl (I think everythingproposed so far is?), a prototype doesn't even need to be an extension,it can be in pure PHP. Then once the design is finalised, you have aready-made polyfill for older PHP versions, and a set of tests for thenative version :)

We might also want to do some general investigation of what otherlanguages and frameworks provide, and which decisions have proven goodor bad in practice.


# Lossy Transforms

Automatic normalisation and stripping of BOMs seems useful, but itimmediately rules out use of this class for anything where you want toget back what you put in. For instance, if an ORM used Text instancesfor strings in data models, it would generate extra Update queries onthe database even when the string wasn't otherwise changed. I think itwould be better to make this easy but explicit.


# UTF-8 on the outside, UTF-16 on the inside

I know this will be a very common combination, but it feels odd that anapplication which actually wanted to work with UTF-16 would need toperform round-trips through UTF-8 just to use this class. It should atleast be possible to specify the encoding on input and output.

Ruby takes an interesting approach where strings are tagged with theircurrent binary encoding, and only converted to another form if actuallyrequired. If your input layer says "$name = new Text($_GET['name'],'Windows-1252');" and your output layer says "echo$name->asBytes('Windows-1252');" the overhead of converting to UTF-16can be skipped entirely, unless something in between says "$name =$name->wordsToUpper()". This also removes another source of lossytransformation, since some encoding conversions aren't perfectlyreversible (e.g. the source encoding has more than one byte sequencemapped to the same Unicode code point).


# Internationalisation

Having locale and collation as state on the object, rather thanparameters on relevant methods, feels like muddling responsibilities. Itmakes it hard to reason about what exactly some of the methods will do:Can I trust that this object will give me a sensible result fromcompareWith, or has it been assigned a collation somewhere else? Whatexactly will be the definition of "replace" or "contains" for this pairof objects?

How users will work with these also needs careful thought - your firstlisted design goal is "keep it simple", but under locales andInternationalisation is the worrying sentence "This will requireextensive documentation". This is one of those places where "doing itright" is really hard to combine with "making it easy", because languageis inherently complex, but users will expect a simple answer to "how doI make it case-insensitive?"


# Allowing other abstractions

I 100% approve of your use of grapheme clusters, rather than codepoints, as the primary unit; so many implementations get that wrong.However, when interacting with other systems, reasoning about bytes (orsometimes even codepoints) is essential.

One function that I would really like to see, for instance, is agrapheme-aware version of mb_strcut, to solve tasks like: "encode thisabstract Unicode string as UTF-16BE, truncated to at most 200 bytes,without breaking apart any grapheme clusters".

Thanks again for getting the ball rolling, and I look forward to helpingiterate the design.


Regards,

--
Rowan Tommins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] Unicode Text Processing

Reply via email to