> On 21 Oct 2014, at 21:42, Rowan Collins <rowan.coll...@gmail.com> wrote:
> 
> The only case I can see where a default encoding would be sensible would be 
> where source code itself is in a different encoding, so that u('literal 
> string') works as expected.

This is only a good idea if we can somehow make it file-local. Otherwise if one 
library uses Latin-1 and another uses UTF-8 for some reason, bang!

> 2) Clarify relationship to a "byte string"
> 
> Most of the API acts like this is an abstract object representing a bunch of 
> Unicode code points. As such, I'm not sure what getCodepage() does - a code 
> page (or more properly encoding) is a property of a stream of bytes, so has 
> no meaning in this context, surely? The internal implementation could use 
> UTF-8, UTF-16, or some made-up encoding (like Perl6's "NFG" system) and the 
> user should never need to know (other than to understand performance 
> implications).
> 
> On the other hand, when you *do* want a stream of bytes, the class doesn't 
> seem to have an explicit way to get one. The (currently undocumented) 
> behaviour is apparently to spit out UTF-8 if cast to a string, but it would 
> be nice to have an explicit function which could be passed a parameter in 
> order to serialise to, say, UTF-16, instead.

I agree on both these points. ->toBytes or ->encode with an explicit charset 
parameter would be good. I don’t see the point of getCodepage().

> 3) The Grapheme Question
> 
> This has been raised a few times, so I won't labour the point, just mention 
> my current thinking.
> 
> Unicode is complicated. Partly, that's because of a series of compromises in 
> its design; but partly, it's because writing systems are complicated, and 
> Unicode tries harder than most previous systems to acknowledge that. So, 
> there's a tradeoff to be made between giving users what they think they need, 
> thus hiding the messy details, and giving users the power to do things right, 
> in a more complex way.
> 
> There is also a namespace mess if you insist on every function and property 
> having to declare what level of abstraction it's talking about - e.g. 
> $codePointLength instead of $length.
> 
> An idea I've been toying with is rather than having one class representing 
> the slippery notion of "a Unicode string", having (at least) two, closely 
> tied, classes: CodePointString (roughly = UString right now) and 
> GraphemeString (a higher level abstraction tied to the same internal 
> representation).
> 
> I intend to mock this up as a set of interfaces at some point, but the basic 
> idea is that you could write this:
> 
> // Get an abstract object from a byte string, probably a GraphemeString, 
> parsing the input as UTF-8
> $str = u('some text');
> // Perform an operation that explicitly deals in Code Points
> $str = $str->asCodePoints()->normalise('NFC');
> // Get information using a higher level of abstraction
> $length = $str->asGraphemes()->length;
> // Perform a high-level mutation, then convert right back to a concrete 
> string of bytes
> echo $str->asGraphemes()->reverse()->asByteString('UTF-16');
> 
> Calling asGraphemes() on a GraphemeString or asCodePoints() on a 
> CodePointString would be legal but a no-op, so it would be safe to accept 
> both as input to a function, then switch to whichever level the task required.
> 
> I'm not sure if this finds a good balance between complexity and 
> user-friendliness, and would welcome anyone's thoughts.

I’d rather have some grapheme-specific functions and some code point functions 
on the same class. Make array-like indexing with [] be by code points as you 
may be able to do that in constant time, and because there might be multiple 
approaches to choosing graphemes. Have ->codepointAt(), but also 
->nthGrapheme() or something like it. There’s no need for grapheme versions of 
all functions, but others would need them.

Though your approach has its own merits.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to