Re: [PHP-DEV] [RFC] Unicode Text Processing

Larry Garfield Wed, 21 Dec 2022 18:27:59 -0800

On Thu, Dec 15, 2022, at 9:34 AM, Derick Rethans wrote:
> Hi,
>
> I have just published an initial draft of the "Unicode Text Processing" 
> RFC, a proposal to have performant unicode text processing always 
> available to PHP users, by introducing a new "Text" class.
>
> You can find it at:
> https://wiki.php.net/rfc/unicode_text_processing
>
> I'm looking forwards to hearing your opinions, additions, and 
> suggestions — the RFC specifically asks for these in places.
>
> cheers,
> Derick
>
>
> -- 
> https://derickrethans.nl | https://xdebug.org | https://dram.io
>
> Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
> Host of PHP Internals News: https://phpinternals.news
>
> mastodon: @derickr@phpc.social @xdebug@phpc.social
> twitter: @derickr and @xdebug



Derick, thank you for tackling this.  It's a decidedly not-simple problem space 
and I'm glad someone like you is looking into it.

I'm overall in favor of the RFC, though I have some comments/pushback.  First, 
here's my notes just as I'm reading through it:

Re Text::create, which the RFC suggests can be aliased as a function for easier 
use: Are you sure?  Symfony has stand-alone wrapper functions.  You cannot 
alias a static method directly to a function.  cf: https://3v4l.org/V4kP2  (It 
would be really nice if that worked!  But it doesn't seem to.)

Text::concat: What happens if different Text objects passed to that have 
different collations?  Which gets used, the first, last, "best fit"...?

Text->wrap: " If $cutLongWords is set, no Text element will be larger than 
$maxWidth."  Can you include an example here?  I have to mentally noodle 
through what this means, which means it could use an example.

Text->getPositionOfFirstOccurrence() et al: These should not return false on 
not-found.  That is an anti-pattern.  That PHP's existing libraries do that is 
a bug, not a feature.  It has caused no end of bugs.  null is the correct thing 
to return here, especially with the new null-handling syntax we have now in PHP 
8.  Do not use "false" as a not-found return, ever.  Another option to consider 
is if some of them should return an empty Text object.  (That may not be the 
best answer, but it's one worth considering.)

Also, all of those names are very long. :-(

Text->returnFromFirstOccurence(): I much prefer startingWith().  It's half the 
length and just as if not more descriptive.  It also implies, to me, that the 
$search will be included in the result, whereas startingAt(), for whatever 
reason, doesn't seem like it does.

Text->contains(): The header is missing the defined return type.

Comparing Text Objects: Oh, for being able to overload those operators.  This 
would be a great use case for it. :-(

The examples in case-conversion are hard to follow, because the font of code 
samples is not that different from normal text.  Could you perhaps multi-line 
them, to make it clearer where the "in" text ends and the "out" text starts?

How do toTitle() and wordsToUpper() differ?  They sound like the same thing...  
(Please note the difference in the RFC.)

Why two methods for length?  And why confuse it with "character" when the text 
has been very consistent about using grapheme to this point.

getCodePointCount(): I... don't understand how this is different from length, 
so I don't see why we'd use it.  If it's kept, please include a better 
explanation of what it is or why I'd care.

getWordCount(): The example uses getWordIterator as a property, when I think 
it's supposed to be a method.  Also, it's not syntax highlighted.

"The return of the iterators are effected by the text's locale." - affected, 
not effected.

getCharacterIterator(): Again, dropping in the word character here.  Calling it 
getGraphemeIterator() would be terrible, of course. :-)  This feels like an 
older part of the text that wasn't updated when most of it started using 
grapheme.  Perhaps skip the explanation here and move a central definition of 
"character" to the start of the RFC?  (Which could be "means the same as 
grapheme in this case, NOT the same as byte.")

getWordIterator(): It's not clear to me if this includes whitespace as its own 
Text objects.  Would the string "Mr. Smith Goes to Washington" be a word 
iterator of ["Mr.", "Smith", "Goes", "to", "Washington"]?  Or ["Mr", ".", 
"Smith", "Goes", "to", "Washington"]?  Or ["Mr", ".", " ", "Smith", " ", 
"Goes", " ", "to", " ", "Washington"]?  I'm not clear which is the intent here. 
 (Feel free to steal this example for the RFC.)

getLineIterator(): I do not understand this description at all.  From the name, 
I'd expect it to break the string at newline characters.  The description seems 
like it's something completely different I do not understand.

getTitleIterator(): What's a title, in this context?

Transliteration section: The formatting here seems wonky and confusing.  Please 
clean up.

-----

Second, there is a PHP-FIG Working Group on translation.  It's mostly idle at 
the moment, as we're waiting on the MessageFormat working group at the W3C to 
stabilize their next version so we can just steal it.  I don't know that 
there's any direct overlap between this RFC and that WG, but I'm mentioning it 
for transparency, and to encourage people to think about how they could both be 
developed to play nicely together, whatever that means.

Third, is there some way to say "this string, but in some other collation?"  It 
looks like the only way to do that is via Text::create($txt, 'new-collation') / 
new Text($txt, 'new-collation').  A ->withCollation('new-collation') method 
would be very helpful, especially as so many methods rely on the collation for 
things like case insensitivity.  That way, we could do 
$txt->withCollation('case-insensitive-english')->split(',') (or similar).

Fourth, that brings me to my biggest concern.  "The format of this 
locale/collation name needs extensive documentation." - This line scares the 
ever-loving crap out of me. :-)  We know from experience that complex 
formatting strings are trivially screwed up, mistyped, or otherwise gotten 
wrong.  Especially when they're not self-evident.  ("ks" means 
"case-insensitive"?  I would never have guessed that in a million years.)  The 
links provided to the Unicode sites don't really illuminate anything for me.

This to me sounds like it cries out for either a builder object, an enumeration 
(or multiple), or some combination of those.  It sounds like the TextCoallator 
class is maybe trying to be that, but it's still under-described, especially 
for anyone who hasn't already used intl.  It also looks like it's just 
producing a string, rather than, what I'd consider probably more ergonomic, 
using the object directly like DateTimeZone.

I'm not sure what the best answer here is since I don't know the problem space 
well enough, but I do know that "here's a string, GL" is insufficient.  We 
should noodle on how to make that more ergonomic so that casual developers 
don't get it horribly subtly wrong.  (Because you know they/we/I will on this 
topic.)  That may also include extra utility methods like 
->withCaseInsensitiveCollation() (which changes only the case sensitivity 
marker but leaves the rest of the collation alone), or something like that.  
Again, I'm not sure what the best design here is.

Everything i just said also applies to the "$transliterationString", which is 
mentioned in passing but no description is provided.  I have no clue what the 
syntax for that even is.

And of course it would be off brand for me to not note that issues with a fixed 
set of methods on an object like this, rather than pipe-friendly functions that 
are innately more extensible.  I know, I know, we don't have pipes yet, but I 
have to mention it anyway or people would be worried about me. :-)

--Larry Garfield

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] Unicode Text Processing

Reply via email to