On 21/10/2014 08:06, Joe Watkins wrote:
Morning internalz,

        https://wiki.php.net/rfc/ustring

        This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

        Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Cheers
Joe



I think this looks like a really great start at creating something actually useful, rather than getting stuck at the drawing board. I like that the scope is quite small initially - where does the "single responsibility" of a class that represents a string end, anyway? :)

A few opinions:

1) Global / static defaults are bad.

The existence of the setDefaultCodepage method feels like an anti-pattern to me. It means libraries can't rely on this class working the same way in two different host environments, or even at two re-entries in the same program. Effectively, if you don't know what the second argument to the constructor will default to, you can't actually treat it as optional unless you're writing monolithic code. This is a common pattern in PHP, but http_build_query() would be so much more pleasant if I could safely call it with 1 argument instead of 3.

I think the default should be hard-coded to UTF-8, which according to previous discussion is always the default *output* encoding, so would mean this would always work: $aUString = new UString( (string)$aUString ); Any other encoding will be dependent on, and known from, the context where the object is created - if grabbing data from an HTTP request, a header should tell them; if from a database, a connection parameter; and so on.

The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected. I guess if we ever went down the route of special literal syntax like u'literal string', the declared source encoding could be used.

Actually, the u() shortcut function appears to be missing the encoding parameter completely; is this deliberate?

2) Clarify relationship to a "byte string"

Most of the API acts like this is an abstract object representing a bunch of Unicode code points. As such, I'm not sure what getCodepage() does - a code page (or more properly encoding) is a property of a stream of bytes, so has no meaning in this context, surely? The internal implementation could use UTF-8, UTF-16, or some made-up encoding (like Perl6's "NFG" system) and the user should never need to know (other than to understand performance implications).

On the other hand, when you *do* want a stream of bytes, the class doesn't seem to have an explicit way to get one. The (currently undocumented) behaviour is apparently to spit out UTF-8 if cast to a string, but it would be nice to have an explicit function which could be passed a parameter in order to serialise to, say, UTF-16, instead.

3) The Grapheme Question

This has been raised a few times, so I won't labour the point, just mention my current thinking.

Unicode is complicated. Partly, that's because of a series of compromises in its design; but partly, it's because writing systems are complicated, and Unicode tries harder than most previous systems to acknowledge that. So, there's a tradeoff to be made between giving users what they think they need, thus hiding the messy details, and giving users the power to do things right, in a more complex way.

There is also a namespace mess if you insist on every function and property having to declare what level of abstraction it's talking about - e.g. $codePointLength instead of $length.

An idea I've been toying with is rather than having one class representing the slippery notion of "a Unicode string", having (at least) two, closely tied, classes: CodePointString (roughly = UString right now) and GraphemeString (a higher level abstraction tied to the same internal representation).

I intend to mock this up as a set of interfaces at some point, but the basic idea is that you could write this:

// Get an abstract object from a byte string, probably a GraphemeString, parsing the input as UTF-8
$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;
// Perform a high-level mutation, then convert right back to a concrete string of bytes
echo $str->asGraphemes()->reverse()->asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on a CodePointString would be legal but a no-op, so it would be safe to accept both as input to a function, then switch to whichever level the task required.

I'm not sure if this finds a good balance between complexity and user-friendliness, and would welcome anyone's thoughts.

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to