On Sat, 28 Feb 2015, Rowan Collins wrote: > On 28/02/2015 06:48, Joe Watkins wrote: > > Morning internals, > > > > This is just a quick note to announce my intention to ready this RFC > > for voting next week. > > > > I know I'm a little late maybe, I was real sick most of last week, so > > couldn't do anything useful. > > > > A couple of us intend to fix outstanding issues on github and those > > raised here, tidy the RFC and open the vote for 7. > > > > I would ask anyone interested to scan through this thread and announce > > concerns that are not mentioned asap. > > I still think this class is trying to do several jobs, and not doing any of > them very well, and I fear that people will see this class and expect it to > solve problems which it actually ignores. > > Here are some concrete use cases I would like a simple interface to solve for > me: > > - Take text from an ISO 88592-2 data source, pass it through generic text > filters, and pass it to a UTF-16 data target. > - Given a long string of Unicode text, give me a valid UTF-8 string which fits > into a buffer with fixed byte size; i.e. give me the largest number of whole > code points which fit into that number of bytes once encoded. > - As above, but without stripping diacritics off the last character of the > resulting string, i.e. give me the largest number of whole graphemes which > fit. > - Split a string into equal sized chunks of readable characters (graphemes), > regardless of how many bytes or code points each chunk contains. > > UString currently falls short of all of these: > > - I can specify my input encoding (in the constructor or helper method, > over-riding a static default, which is equivalent to ext/mbstring's global > setting), but not my output encoding (there is no method to ask for a byte > representation other than a string cast, which by definition has no > parameters).
Yeah, there should be an output method to convert to a target encoding. > - I can ask for a fixed number of code points, but don't know how many bytes > these will take until I cast to a UTF-8 string. As I said before, indexes into strings should not be done on code points, as the following would then break the characters: $s = new Text("Ås"); echo $s->substring(1); The output would be: ̊ Where as: $s = new Text("Ås); echo $s->substring(1); would output "s". Which is not what people would expect. > - I can't manipulate anything at the grapheme level at all, even though this > is the most meaningful level of operation in most cases. Yes - graphemes should be the base blocks, not code points. > Things it does do: > > - a handful of methods give meaningful international text support: toUpper(), > toLower(), trim() > - some methods could be done on byte strings if I ensure they're all in UTF-8: > replace(), contains(), startsWith(), endsWith(), repeat() That doesn't always work when you have graphemes, or text in different normalisation forms. Ie, it should consider Å U+00C5 and Å (U+0041 + U+030A) the same for contains and startsWith — ie, handle normalisation for comparison. > - there may be limited situations where I want to dive into the code points > which make up a string, although I can't think of many: $length, pad(), > indexOf(), lastIndexOf(), charAt(), replaceSlice() Break iterators on either code points, or graphemes, might work here? > - remaining methods avoid me creating invalid UTF-8, but don't help me > much with real-life text: chunk(), split(), substring() - I can ask > what codepage my Unicode string is in; I don't even understand what > this means > > I think an efficient OO wrapper around ICU is a great idea, but more > thought needs to go into what methods are exposed, and how people are > going to use them in real code. Yes - I agree. I think this current proposal is a good start, but it needs to be worked out a little bit more before I think we should vote on it — how much I would like to see something like this in PHP. cheers, Derick -- http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug Posted with an email client that doesn't mangle email: alpine
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php