On Thu, Dec 15, 2022, at 9:34 AM, Derick Rethans wrote: > Hi, > > I have just published an initial draft of the "Unicode Text Processing" > RFC, a proposal to have performant unicode text processing always > available to PHP users, by introducing a new "Text" class. > > You can find it at: > https://wiki.php.net/rfc/unicode_text_processing > > I'm looking forwards to hearing your opinions, additions, and > suggestions — the RFC specifically asks for these in places. > > cheers, > Derick > > > -- > https://derickrethans.nl | https://xdebug.org | https://dram.io > > Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support > Host of PHP Internals News: https://phpinternals.news > > mastodon: @derickr@phpc.social @xdebug@phpc.social > twitter: @derickr and @xdebug
Derick, thank you for tackling this. It's a decidedly not-simple problem space and I'm glad someone like you is looking into it. I'm overall in favor of the RFC, though I have some comments/pushback. First, here's my notes just as I'm reading through it: Re Text::create, which the RFC suggests can be aliased as a function for easier use: Are you sure? Symfony has stand-alone wrapper functions. You cannot alias a static method directly to a function. cf: https://3v4l.org/V4kP2 (It would be really nice if that worked! But it doesn't seem to.) Text::concat: What happens if different Text objects passed to that have different collations? Which gets used, the first, last, "best fit"...? Text->wrap: " If $cutLongWords is set, no Text element will be larger than $maxWidth." Can you include an example here? I have to mentally noodle through what this means, which means it could use an example. Text->getPositionOfFirstOccurrence() et al: These should not return false on not-found. That is an anti-pattern. That PHP's existing libraries do that is a bug, not a feature. It has caused no end of bugs. null is the correct thing to return here, especially with the new null-handling syntax we have now in PHP 8. Do not use "false" as a not-found return, ever. Another option to consider is if some of them should return an empty Text object. (That may not be the best answer, but it's one worth considering.) Also, all of those names are very long. :-( Text->returnFromFirstOccurence(): I much prefer startingWith(). It's half the length and just as if not more descriptive. It also implies, to me, that the $search will be included in the result, whereas startingAt(), for whatever reason, doesn't seem like it does. Text->contains(): The header is missing the defined return type. Comparing Text Objects: Oh, for being able to overload those operators. This would be a great use case for it. :-( The examples in case-conversion are hard to follow, because the font of code samples is not that different from normal text. Could you perhaps multi-line them, to make it clearer where the "in" text ends and the "out" text starts? How do toTitle() and wordsToUpper() differ? They sound like the same thing... (Please note the difference in the RFC.) Why two methods for length? And why confuse it with "character" when the text has been very consistent about using grapheme to this point. getCodePointCount(): I... don't understand how this is different from length, so I don't see why we'd use it. If it's kept, please include a better explanation of what it is or why I'd care. getWordCount(): The example uses getWordIterator as a property, when I think it's supposed to be a method. Also, it's not syntax highlighted. "The return of the iterators are effected by the text's locale." - affected, not effected. getCharacterIterator(): Again, dropping in the word character here. Calling it getGraphemeIterator() would be terrible, of course. :-) This feels like an older part of the text that wasn't updated when most of it started using grapheme. Perhaps skip the explanation here and move a central definition of "character" to the start of the RFC? (Which could be "means the same as grapheme in this case, NOT the same as byte.") getWordIterator(): It's not clear to me if this includes whitespace as its own Text objects. Would the string "Mr. Smith Goes to Washington" be a word iterator of ["Mr.", "Smith", "Goes", "to", "Washington"]? Or ["Mr", ".", "Smith", "Goes", "to", "Washington"]? Or ["Mr", ".", " ", "Smith", " ", "Goes", " ", "to", " ", "Washington"]? I'm not clear which is the intent here. (Feel free to steal this example for the RFC.) getLineIterator(): I do not understand this description at all. From the name, I'd expect it to break the string at newline characters. The description seems like it's something completely different I do not understand. getTitleIterator(): What's a title, in this context? Transliteration section: The formatting here seems wonky and confusing. Please clean up. ----- Second, there is a PHP-FIG Working Group on translation. It's mostly idle at the moment, as we're waiting on the MessageFormat working group at the W3C to stabilize their next version so we can just steal it. I don't know that there's any direct overlap between this RFC and that WG, but I'm mentioning it for transparency, and to encourage people to think about how they could both be developed to play nicely together, whatever that means. Third, is there some way to say "this string, but in some other collation?" It looks like the only way to do that is via Text::create($txt, 'new-collation') / new Text($txt, 'new-collation'). A ->withCollation('new-collation') method would be very helpful, especially as so many methods rely on the collation for things like case insensitivity. That way, we could do $txt->withCollation('case-insensitive-english')->split(',') (or similar). Fourth, that brings me to my biggest concern. "The format of this locale/collation name needs extensive documentation." - This line scares the ever-loving crap out of me. :-) We know from experience that complex formatting strings are trivially screwed up, mistyped, or otherwise gotten wrong. Especially when they're not self-evident. ("ks" means "case-insensitive"? I would never have guessed that in a million years.) The links provided to the Unicode sites don't really illuminate anything for me. This to me sounds like it cries out for either a builder object, an enumeration (or multiple), or some combination of those. It sounds like the TextCoallator class is maybe trying to be that, but it's still under-described, especially for anyone who hasn't already used intl. It also looks like it's just producing a string, rather than, what I'd consider probably more ergonomic, using the object directly like DateTimeZone. I'm not sure what the best answer here is since I don't know the problem space well enough, but I do know that "here's a string, GL" is insufficient. We should noodle on how to make that more ergonomic so that casual developers don't get it horribly subtly wrong. (Because you know they/we/I will on this topic.) That may also include extra utility methods like ->withCaseInsensitiveCollation() (which changes only the case sensitivity marker but leaves the rest of the collation alone), or something like that. Again, I'm not sure what the best design here is. Everything i just said also applies to the "$transliterationString", which is mentioned in passing but no description is provided. I have no clue what the syntax for that even is. And of course it would be off brand for me to not note that issues with a fixed set of methods on an object like this, rather than pipe-friendly functions that are innately more extensible. I know, I know, we don't have pipes yet, but I have to mention it anyway or people would be worried about me. :-) --Larry Garfield -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php