On 15/12/2022 15:34, Derick Rethans wrote:
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.
You can find it at:
https://wiki.php.net/rfc/unicode_text_processing
I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
As others have said already, thank you for taking a stab at this
important topic. I agree that it would be a really useful feature for
the language, but it's also a really difficult one to get right. Here
are my initial thoughts...
# Design Process
Rather than designing the whole class "on paper", I think this really
needs to be built as a prototype, where we can build up documentation
and tests, plug variations into some real life scenarios, and have
separate discussions about different details. If we limit ourselves
initially to features already exposed by ext/intl (I think everything
proposed so far is?), a prototype doesn't even need to be an extension,
it can be in pure PHP. Then once the design is finalised, you have a
ready-made polyfill for older PHP versions, and a set of tests for the
native version :)
We might also want to do some general investigation of what other
languages and frameworks provide, and which decisions have proven good
or bad in practice.
# Lossy Transforms
Automatic normalisation and stripping of BOMs seems useful, but it
immediately rules out use of this class for anything where you want to
get back what you put in. For instance, if an ORM used Text instances
for strings in data models, it would generate extra Update queries on
the database even when the string wasn't otherwise changed. I think it
would be better to make this easy but explicit.
# UTF-8 on the outside, UTF-16 on the inside
I know this will be a very common combination, but it feels odd that an
application which actually wanted to work with UTF-16 would need to
perform round-trips through UTF-8 just to use this class. It should at
least be possible to specify the encoding on input and output.
Ruby takes an interesting approach where strings are tagged with their
current binary encoding, and only converted to another form if actually
required. If your input layer says "$name = new Text($_GET['name'],
'Windows-1252');" and your output layer says "echo
$name->asBytes('Windows-1252');" the overhead of converting to UTF-16
can be skipped entirely, unless something in between says "$name =
$name->wordsToUpper()". This also removes another source of lossy
transformation, since some encoding conversions aren't perfectly
reversible (e.g. the source encoding has more than one byte sequence
mapped to the same Unicode code point).
# Internationalisation
Having locale and collation as state on the object, rather than
parameters on relevant methods, feels like muddling responsibilities. It
makes it hard to reason about what exactly some of the methods will do:
Can I trust that this object will give me a sensible result from
compareWith, or has it been assigned a collation somewhere else? What
exactly will be the definition of "replace" or "contains" for this pair
of objects?
How users will work with these also needs careful thought - your first
listed design goal is "keep it simple", but under locales and
Internationalisation is the worrying sentence "This will require
extensive documentation". This is one of those places where "doing it
right" is really hard to combine with "making it easy", because language
is inherently complex, but users will expect a simple answer to "how do
I make it case-insensitive?"
# Allowing other abstractions
I 100% approve of your use of grapheme clusters, rather than code
points, as the primary unit; so many implementations get that wrong.
However, when interacting with other systems, reasoning about bytes (or
sometimes even codepoints) is essential.
One function that I would really like to see, for instance, is a
grapheme-aware version of mb_strcut, to solve tasks like: "encode this
abstract Unicode string as UTF-16BE, truncated to at most 200 bytes,
without breaking apart any grapheme clusters".
Thanks again for getting the ball rolling, and I look forward to helping
iterate the design.
Regards,
--
Rowan Tommins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php