Hi Joe and Derick, On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans <der...@php.net> wrote:
> I think there are a few issues with the proposal, although I like the > general idea. I've had the tab with the RFC open since October... but > never looked at it until now :-/. So, a few comments: > > - UString as a name. > > I think I am going to prefer "Text" as a class name. Unicode (and > intl/icu) have lots of operators acting on items containing unicode > strings. But they are really pieces of text. For example sentences, word > break iterators, etc. UString *feels* clunky, and not "standard". If > it's going to be part of PHP core, then we should pick a "core" name. (I > might prefer String, but that's going to cause a whole lot of issues > obviously). > I think it's better to have "string/text" data as certain encoding/codepage. Although Unicode encoding conversion is cheap, (I mean cheap compare to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8 is better because - PCRE only supports UTF-8 - SQLite only supports UTF-8 - PHP uses UTF-8 as the default now - Recent web apps uses UTF-8 as encoding - Single encoding for stored text/string is simpler - Considering normalization, having UTF-8 with NFC is less confusing. However, I don't mind too much allowing any encoding stored in "Text"/ "UString" object. IIRC, Ruby does this and have not much problem. If we have multiple encoding support. We should resolve $new = $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error? $new = $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error? $new = $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled? > - "Needs More Methods" > > I had a look at the API that that links to, and I miss operators like > iterators. Over words, sentences, characters, etc. Basically the > functionality of > http://docs.php.net/manual/en/class.intlbreakiterator.php, > http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and > http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php > > I realize intl already immplements, this, but it's really beneficial to > have for a "Text" class - especially for replacing functionality where > people now look over a string - with a character index. > > There are missing features... We may implement most of them before release. > - "Not a full String API Replacement" > > I would certainly expect more from it than just the UnicodeString API. > Perhaps not for a first iteration, but certainly for subsequent > versions. Things like transliterations, and specifically iterators would > be high on my list. > Sounds good. > > - "Patch" > > toUpper/toLower, there is a missing one for toTitle > > - In the code's README: > > "Note: UString is interchangable with zend strings for method parameters > and can be cast for output/conversion to zend strings" > > How does that work? And what would it convert to? > I guess Joe means it's using zend_string internally? > > - How are "characters" counted? > > Is a character a Code Point, or is a character a base character + > combining diacritics. In the first form, A + ° is considered as > characters, in the second option, just one. For wordwrap, splice, > substring, it is really important that only the *full sequence* is > considered as a character. And hence, a character really should be the > full sequence. The text in "charAt" seems to contradict that, and that > is a mistake. > One reason I prefer NFC. > > In the original PHP 6 we didn't do that due to perormance reasons, but > that point is moot now as only people who opt into using "Text" will > suffer from this. > > - "trim" > > What is a leading or trailing space? Is it just U+0020, or other Unicode > defined space characters as well? ( , U+00A0 comes to mind here) > Any "space" is better to be trimmed. > > - What is "UG(defaultpad)," about? > > - For the code: > > - there is some interesting, non standard whitespaceing going on: > > - { goes on next line after a func decl > - sometimes 4 spaces in stead of a tab are used for indentation, > > - Why is there no __toString() ? > If this is missing, there should be __toString() > > - How can other extensions, not really making use of "Text", use there > strings (as UTF8 strings f.e.) > I agree that Internal API needs improvement. Overall, I think it's good for starting if basic issue is resolved. The most important is "if it supports single or multiple encoding for stored text/string?". There are many things programmers should know if multiple encoding is supported, but I don't object strongly to have multiple encoding support. It's nice to have ability to handle SJIS, ISO-2022, etc natively. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net