Hey Joe,

I think there are a few issues with the proposal, although I like the 
general idea. I've had the tab with the RFC open since October... but 
never looked at it until now :-/. So, a few comments:

- UString as a name.

I think I am going to prefer "Text" as a class name. Unicode (and 
intl/icu) have lots of operators acting on items containing unicode 
strings. But they are really pieces of text. For example sentences, word 
break iterators, etc. UString *feels* clunky, and not "standard". If 
it's going to be part of PHP core, then we should pick a "core" name. (I 
might prefer String, but that's going to cause a whole lot of issues 
obviously).

- "Needs More Methods"

I had a look at the API that that links to, and I miss operators like 
iterators. Over words, sentences, characters, etc. Basically the 
functionality of  
http://docs.php.net/manual/en/class.intlbreakiterator.php, 
http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and 
http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

I realize intl already immplements, this, but it's really beneficial to 
have for a "Text" class - especially for replacing functionality where 
people now look over a string - with a character index. 

- "Not a full String API Replacement"

I would certainly expect more from it than just the UnicodeString API. 
Perhaps not for a first iteration, but certainly for subsequent 
versions. Things like transliterations, and specifically iterators would 
be high on my list.

- "Patch"

toUpper/toLower, there is a missing one for toTitle

- In the code's README:

"Note: UString is interchangable with zend strings for method parameters 
and can be cast for output/conversion to zend strings"

How does that work? And what would it convert to?

- How are "characters" counted?

Is a character a Code Point, or is a character a base character + 
combining diacritics. In the first form, A + ° is considered as 
characters, in the second option, just one. For wordwrap, splice, 
substring, it is really important that only the *full sequence* is 
considered as a character. And hence, a character really should be the 
full sequence. The text in "charAt" seems to contradict that, and that 
is a mistake.

In the original PHP 6 we didn't do that due to perormance reasons, but 
that point is moot now as only people who opt into using "Text" will 
suffer from this.

- "trim"

What is a leading or trailing space? Is it just U+0020, or other Unicode 
defined space characters as well? ( , U+00A0 comes to mind here)

- What is "UG(defaultpad)," about?

- For the code:

  - there is some interesting, non standard whitespaceing going on:

    - { goes on next line after a func decl
    - sometimes 4 spaces in stead of a tab are used for indentation, 

- Why is there no __toString() ?

- How can other extensions, not really making use of "Text", use there 
  strings (as UTF8 strings f.e.)


cheers,
Derick


On Sat, 28 Feb 2015, Joe Watkins wrote:

> Morning internals,
> 
>     This is just a quick note to announce my intention to ready this RFC
> for voting next week.
> 
>     I know I'm a little late maybe, I was real sick most of last week, so
> couldn't do anything useful.
> 
>     A couple of us intend to fix outstanding issues on github and those
> raised here, tidy the RFC and open the vote for 7.
> 
>    I would ask anyone interested to scan through this thread and announce
> concerns that are not mentioned asap.
> 
> Cheers
> Joe
> 
> On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright <daveran...@php.net> wrote:
> 
> > On 24 October 2014 07:03, Joe Watkins <pthre...@pthreads.org> wrote:
> >
> >> On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
> >> > Hi!
> >> >
> >> > > P.S. u() is a bad name, will break lots of code, i.e.
> >> >
> >> > Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
> >> safe.
> >> >
> >>
> >> /me cringes ...
> >>
> >> I wonder how much of a problem it really is, usually when we say some
> >> function name is a problem is because of hundreds and hundreds of
> >> results on github.
> >>
> >> If it's a huge problem then we should rename it, if we have to dig
> >> around for a single project that's incompatible, or even a handful, then
> >> it's not really a problem.
> >>
> >> Cheers
> >> Joe
> >
> >
> > I can see this being something relatively common. While I personally would
> > never do it, there are a few reasons I can think of that people *might* do
> > it:
> >
> > - Wrapper for creating <u> HTML output
> > - urlencode() shortcut
> > - (obviously) various unicode-related things
> >
> > Searching on codesearch [1] revealed (amongst a few other hits on the
> > first page) another interesting use of it in the hhvm test suite [2]. It's
> > difficult to search for this because all the available public search
> > engines that I know of do fuzzy matching.
> >
> > Sorry. This sucks, because every other option we have for this is sucks.
> >
> > On the bright side, anything chosen could always be aliased at the top of
> > the file:
> >
> > use function __u as u;
> >
> > This also sucks, but it sucks a little bit less because the collisions are
> > avoided - or at least, avoided in such a way that the onus is on the user -
> > and one can still have the sane name.
> >
> > First-class support at the syntax level (presumably $foo = u"unicode
> > string" since we already have $foo = b"binary string") would IMO be better
> > and (hopefully?) a long-term goal, but I am aware that it is - and probably
> > should be - outside the scope of the current proposal.
> >
> > [1] https://searchcode.com/?q=function+u+lang%3Aphp
> > [2]
> > https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13
> >
> 

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine
-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to