Re: [PHP-DEV] [RFC] UString

Rowan Collins Tue, 21 Oct 2014 13:44:34 -0700

On 21/10/2014 08:06, Joe Watkins wrote:

Morning internalz,


        https://wiki.php.net/rfc/ustring

        This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

        Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Cheers
Joe

I think this looks like a really great start at creating somethingactually useful, rather than getting stuck at the drawing board. I likethat the scope is quite small initially - where does the "singleresponsibility" of a class that represents a string end, anyway? :)


A few opinions:

1) Global / static defaults are bad.

The existence of the setDefaultCodepage method feels like ananti-pattern to me. It means libraries can't rely on this class workingthe same way in two different host environments, or even at twore-entries in the same program. Effectively, if you don't know what thesecond argument to the constructor will default to, you can't actuallytreat it as optional unless you're writing monolithic code. This is acommon pattern in PHP, but http_build_query() would be so much morepleasant if I could safely call it with 1 argument instead of 3.

I think the default should be hard-coded to UTF-8, which according toprevious discussion is always the default *output* encoding, so wouldmean this would always work: $aUString = new UString( (string)$aUString); Any other encoding will be dependent on, and known from, the contextwhere the object is created - if grabbing data from an HTTP request, aheader should tell them; if from a database, a connection parameter; andso on.

The only case I can see where a default encoding would be sensible wouldbe where source code itself is in a different encoding, so thatu('literal string') works as expected. I guess if we ever went down theroute of special literal syntax like u'literal string', the declaredsource encoding could be used.

Actually, the u() shortcut function appears to be missing the encodingparameter completely; is this deliberate?


2) Clarify relationship to a "byte string"

Most of the API acts like this is an abstract object representing abunch of Unicode code points. As such, I'm not sure what getCodepage()does - a code page (or more properly encoding) is a property of a streamof bytes, so has no meaning in this context, surely? The internalimplementation could use UTF-8, UTF-16, or some made-up encoding (likePerl6's "NFG" system) and the user should never need to know (other thanto understand performance implications).

On the other hand, when you *do* want a stream of bytes, the classdoesn't seem to have an explicit way to get one. The (currentlyundocumented) behaviour is apparently to spit out UTF-8 if cast to astring, but it would be nice to have an explicit function which could bepassed a parameter in order to serialise to, say, UTF-16, instead.


3) The Grapheme Question

This has been raised a few times, so I won't labour the point, justmention my current thinking.

Unicode is complicated. Partly, that's because of a series ofcompromises in its design; but partly, it's because writing systems arecomplicated, and Unicode tries harder than most previous systems toacknowledge that. So, there's a tradeoff to be made between giving userswhat they think they need, thus hiding the messy details, and givingusers the power to do things right, in a more complex way.

There is also a namespace mess if you insist on every function andproperty having to declare what level of abstraction it's talking about- e.g. $codePointLength instead of $length.

An idea I've been toying with is rather than having one classrepresenting the slippery notion of "a Unicode string", having (atleast) two, closely tied, classes: CodePointString (roughly = UStringright now) and GraphemeString (a higher level abstraction tied to thesame internal representation).

I intend to mock this up as a set of interfaces at some point, but thebasic idea is that you could write this:

// Get an abstract object from a byte string, probably a GraphemeString,parsing the input as UTF-8

$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;

// Perform a high-level mutation, then convert right back to a concretestring of bytes

echo $str->asGraphemes()->reverse()->asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on aCodePointString would be legal but a no-op, so it would be safe toaccept both as input to a function, then switch to whichever level thetask required.

I'm not sure if this finds a good balance between complexity anduser-friendliness, and would welcome anyone's thoughts.


--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] UString

Reply via email to