Re: [PHP-DEV] [RFC] UString

Yasuo Ohgaki Sun, 01 Mar 2015 13:00:46 -0800

Hi Joe and Derick,

On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans <der...@php.net> wrote:


> I think there are a few issues with the proposal, although I like the
> general idea. I've had the tab with the RFC open since October... but
> never looked at it until now :-/. So, a few comments:
>
> - UString as a name.
>
> I think I am going to prefer "Text" as a class name. Unicode (and
> intl/icu) have lots of operators acting on items containing unicode
> strings. But they are really pieces of text. For example sentences, word
> break iterators, etc. UString *feels* clunky, and not "standard". If
> it's going to be part of PHP core, then we should pick a "core" name. (I
> might prefer String, but that's going to cause a whole lot of issues
> obviously).
>

I think it's better to have "string/text" data as certain encoding/codepage.
Although Unicode encoding conversion is cheap, (I mean cheap compare
to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8
is better because

 - PCRE only supports UTF-8
 - SQLite only supports UTF-8
 - PHP uses UTF-8 as the default now
 - Recent web apps uses UTF-8 as encoding
 - Single encoding for stored text/string is simpler
 - Considering normalization, having UTF-8 with NFC is less confusing.

However, I don't mind too much allowing any encoding stored in "Text"/
"UString" object. IIRC, Ruby does this and have not much problem.

If we have multiple encoding support. We should resolve

$new = $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error?
$new = $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error?
$new = $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled?


> - "Needs More Methods"
>
> I had a look at the API that that links to, and I miss operators like
> iterators. Over words, sentences, characters, etc. Basically the
> functionality of
> http://docs.php.net/manual/en/class.intlbreakiterator.php,
> http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
> http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php
>
> I realize intl already immplements, this, but it's really beneficial to
> have for a "Text" class - especially for replacing functionality where
> people now look over a string - with a character index.
>
>
There are missing features... We may implement most of them before
release.


> - "Not a full String API Replacement"
>
> I would certainly expect more from it than just the UnicodeString API.
> Perhaps not for a first iteration, but certainly for subsequent
> versions. Things like transliterations, and specifically iterators would
> be high on my list.
>

Sounds good.


>
> - "Patch"
>
> toUpper/toLower, there is a missing one for toTitle
>
> - In the code's README:
>
> "Note: UString is interchangable with zend strings for method parameters
> and can be cast for output/conversion to zend strings"
>
> How does that work? And what would it convert to?
>

I guess Joe means it's using zend_string internally?


>
> - How are "characters" counted?
>
> Is a character a Code Point, or is a character a base character +
> combining diacritics. In the first form, A + ° is considered as
> characters, in the second option, just one. For wordwrap, splice,
> substring, it is really important that only the *full sequence* is
> considered as a character. And hence, a character really should be the
> full sequence. The text in "charAt" seems to contradict that, and that
> is a mistake.
>

One reason I prefer NFC.


>
> In the original PHP 6 we didn't do that due to perormance reasons, but
> that point is moot now as only people who opt into using "Text" will
> suffer from this.
>
> - "trim"
>
> What is a leading or trailing space? Is it just U+0020, or other Unicode
> defined space characters as well? (&nbsp;, U+00A0 comes to mind here)
>

Any "space" is better to be trimmed.


>
> - What is "UG(defaultpad)," about?
>
> - For the code:
>
>   - there is some interesting, non standard whitespaceing going on:
>
>     - { goes on next line after a func decl
>     - sometimes 4 spaces in stead of a tab are used for indentation,
>
> - Why is there no __toString() ?
>

If this is missing, there should be __toString()


>
> - How can other extensions, not really making use of "Text", use there
>   strings (as UTF8 strings f.e.)
>

I agree that Internal API needs improvement.

Overall, I think it's good for starting if basic issue is resolved.
The most important is "if it supports single or multiple encoding for
stored text/string?".
There are many things programmers should know if multiple encoding is
supported,
but I don't object strongly to have multiple encoding support. It's nice to
have ability
to handle SJIS, ISO-2022, etc natively.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net

Re: [PHP-DEV] [RFC] UString

Reply via email to