On 21/10/2014 08:06, Joe Watkins wrote:
Morning internalz,
https://wiki.php.net/rfc/ustring
This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.
Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)
Cheers
Joe
I think this looks like a really great start at creating something
actually useful, rather than getting stuck at the drawing board. I like
that the scope is quite small initially - where does the "single
responsibility" of a class that represents a string end, anyway? :)
A few opinions:
1) Global / static defaults are bad.
The existence of the setDefaultCodepage method feels like an
anti-pattern to me. It means libraries can't rely on this class working
the same way in two different host environments, or even at two
re-entries in the same program. Effectively, if you don't know what the
second argument to the constructor will default to, you can't actually
treat it as optional unless you're writing monolithic code. This is a
common pattern in PHP, but http_build_query() would be so much more
pleasant if I could safely call it with 1 argument instead of 3.
I think the default should be hard-coded to UTF-8, which according to
previous discussion is always the default *output* encoding, so would
mean this would always work: $aUString = new UString( (string)$aUString
); Any other encoding will be dependent on, and known from, the context
where the object is created - if grabbing data from an HTTP request, a
header should tell them; if from a database, a connection parameter; and
so on.
The only case I can see where a default encoding would be sensible would
be where source code itself is in a different encoding, so that
u('literal string') works as expected. I guess if we ever went down the
route of special literal syntax like u'literal string', the declared
source encoding could be used.
Actually, the u() shortcut function appears to be missing the encoding
parameter completely; is this deliberate?
2) Clarify relationship to a "byte string"
Most of the API acts like this is an abstract object representing a
bunch of Unicode code points. As such, I'm not sure what getCodepage()
does - a code page (or more properly encoding) is a property of a stream
of bytes, so has no meaning in this context, surely? The internal
implementation could use UTF-8, UTF-16, or some made-up encoding (like
Perl6's "NFG" system) and the user should never need to know (other than
to understand performance implications).
On the other hand, when you *do* want a stream of bytes, the class
doesn't seem to have an explicit way to get one. The (currently
undocumented) behaviour is apparently to spit out UTF-8 if cast to a
string, but it would be nice to have an explicit function which could be
passed a parameter in order to serialise to, say, UTF-16, instead.
3) The Grapheme Question
This has been raised a few times, so I won't labour the point, just
mention my current thinking.
Unicode is complicated. Partly, that's because of a series of
compromises in its design; but partly, it's because writing systems are
complicated, and Unicode tries harder than most previous systems to
acknowledge that. So, there's a tradeoff to be made between giving users
what they think they need, thus hiding the messy details, and giving
users the power to do things right, in a more complex way.
There is also a namespace mess if you insist on every function and
property having to declare what level of abstraction it's talking about
- e.g. $codePointLength instead of $length.
An idea I've been toying with is rather than having one class
representing the slippery notion of "a Unicode string", having (at
least) two, closely tied, classes: CodePointString (roughly = UString
right now) and GraphemeString (a higher level abstraction tied to the
same internal representation).
I intend to mock this up as a set of interfaces at some point, but the
basic idea is that you could write this:
// Get an abstract object from a byte string, probably a GraphemeString,
parsing the input as UTF-8
$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str->asCodePoints()->normalise('NFC');
// Get information using a higher level of abstraction
$length = $str->asGraphemes()->length;
// Perform a high-level mutation, then convert right back to a concrete
string of bytes
echo $str->asGraphemes()->reverse()->asByteString('UTF-16');
Calling asGraphemes() on a GraphemeString or asCodePoints() on a
CodePointString would be legal but a no-op, so it would be safe to
accept both as input to a function, then switch to whichever level the
task required.
I'm not sure if this finds a good balance between complexity and
user-friendliness, and would welcome anyone's thoughts.
--
Rowan Collins
[IMSoP]
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php