> On 21 Oct 2014, at 21:42, Rowan Collins <[email protected]> wrote:
>
> The only case I can see where a default encoding would be sensible would be
> where source code itself is in a different encoding, so that u('literal
> string') works as expected.
This is only a good idea if we can somehow make it file-local. Otherwise if one
library uses Latin-1 and another uses UTF-8 for some reason, bang!
> 2) Clarify relationship to a "byte string"
>
> Most of the API acts like this is an abstract object representing a bunch of
> Unicode code points. As such, I'm not sure what getCodepage() does - a code
> page (or more properly encoding) is a property of a stream of bytes, so has
> no meaning in this context, surely? The internal implementation could use
> UTF-8, UTF-16, or some made-up encoding (like Perl6's "NFG" system) and the
> user should never need to know (other than to understand performance
> implications).
>
> On the other hand, when you *do* want a stream of bytes, the class doesn't
> seem to have an explicit way to get one. The (currently undocumented)
> behaviour is apparently to spit out UTF-8 if cast to a string, but it would
> be nice to have an explicit function which could be passed a parameter in
> order to serialise to, say, UTF-16, instead.
I agree on both these points. ->toBytes or ->encode with an explicit charset
parameter would be good. I don’t see the point of getCodepage().
> 3) The Grapheme Question
>
> This has been raised a few times, so I won't labour the point, just mention
> my current thinking.
>
> Unicode is complicated. Partly, that's because of a series of compromises in
> its design; but partly, it's because writing systems are complicated, and
> Unicode tries harder than most previous systems to acknowledge that. So,
> there's a tradeoff to be made between giving users what they think they need,
> thus hiding the messy details, and giving users the power to do things right,
> in a more complex way.
>
> There is also a namespace mess if you insist on every function and property
> having to declare what level of abstraction it's talking about - e.g.
> $codePointLength instead of $length.
>
> An idea I've been toying with is rather than having one class representing
> the slippery notion of "a Unicode string", having (at least) two, closely
> tied, classes: CodePointString (roughly = UString right now) and
> GraphemeString (a higher level abstraction tied to the same internal
> representation).
>
> I intend to mock this up as a set of interfaces at some point, but the basic
> idea is that you could write this:
>
> // Get an abstract object from a byte string, probably a GraphemeString,
> parsing the input as UTF-8
> $str = u('some text');
> // Perform an operation that explicitly deals in Code Points
> $str = $str->asCodePoints()->normalise('NFC');
> // Get information using a higher level of abstraction
> $length = $str->asGraphemes()->length;
> // Perform a high-level mutation, then convert right back to a concrete
> string of bytes
> echo $str->asGraphemes()->reverse()->asByteString('UTF-16');
>
> Calling asGraphemes() on a GraphemeString or asCodePoints() on a
> CodePointString would be legal but a no-op, so it would be safe to accept
> both as input to a function, then switch to whichever level the task required.
>
> I'm not sure if this finds a good balance between complexity and
> user-friendliness, and would welcome anyone's thoughts.
I’d rather have some grapheme-specific functions and some code point functions
on the same class. Make array-like indexing with [] be by code points as you
may be able to do that in constant time, and because there might be multiple
approaches to choosing graphemes. Have ->codepointAt(), but also
->nthGrapheme() or something like it. There’s no need for grapheme versions of
all functions, but others would need them.
Though your approach has its own merits.
--
Andrea Faulds
http://ajf.me/
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php