> On 21 Oct 2014, at 21:42, Rowan Collins <rowan.coll...@gmail.com> wrote: > > The only case I can see where a default encoding would be sensible would be > where source code itself is in a different encoding, so that u('literal > string') works as expected.
This is only a good idea if we can somehow make it file-local. Otherwise if one library uses Latin-1 and another uses UTF-8 for some reason, bang! > 2) Clarify relationship to a "byte string" > > Most of the API acts like this is an abstract object representing a bunch of > Unicode code points. As such, I'm not sure what getCodepage() does - a code > page (or more properly encoding) is a property of a stream of bytes, so has > no meaning in this context, surely? The internal implementation could use > UTF-8, UTF-16, or some made-up encoding (like Perl6's "NFG" system) and the > user should never need to know (other than to understand performance > implications). > > On the other hand, when you *do* want a stream of bytes, the class doesn't > seem to have an explicit way to get one. The (currently undocumented) > behaviour is apparently to spit out UTF-8 if cast to a string, but it would > be nice to have an explicit function which could be passed a parameter in > order to serialise to, say, UTF-16, instead. I agree on both these points. ->toBytes or ->encode with an explicit charset parameter would be good. I don’t see the point of getCodepage(). > 3) The Grapheme Question > > This has been raised a few times, so I won't labour the point, just mention > my current thinking. > > Unicode is complicated. Partly, that's because of a series of compromises in > its design; but partly, it's because writing systems are complicated, and > Unicode tries harder than most previous systems to acknowledge that. So, > there's a tradeoff to be made between giving users what they think they need, > thus hiding the messy details, and giving users the power to do things right, > in a more complex way. > > There is also a namespace mess if you insist on every function and property > having to declare what level of abstraction it's talking about - e.g. > $codePointLength instead of $length. > > An idea I've been toying with is rather than having one class representing > the slippery notion of "a Unicode string", having (at least) two, closely > tied, classes: CodePointString (roughly = UString right now) and > GraphemeString (a higher level abstraction tied to the same internal > representation). > > I intend to mock this up as a set of interfaces at some point, but the basic > idea is that you could write this: > > // Get an abstract object from a byte string, probably a GraphemeString, > parsing the input as UTF-8 > $str = u('some text'); > // Perform an operation that explicitly deals in Code Points > $str = $str->asCodePoints()->normalise('NFC'); > // Get information using a higher level of abstraction > $length = $str->asGraphemes()->length; > // Perform a high-level mutation, then convert right back to a concrete > string of bytes > echo $str->asGraphemes()->reverse()->asByteString('UTF-16'); > > Calling asGraphemes() on a GraphemeString or asCodePoints() on a > CodePointString would be legal but a no-op, so it would be safe to accept > both as input to a function, then switch to whichever level the task required. > > I'm not sure if this finds a good balance between complexity and > user-friendliness, and would welcome anyone's thoughts. I’d rather have some grapheme-specific functions and some code point functions on the same class. Make array-like indexing with [] be by code points as you may be able to do that in constant time, and because there might be multiple approaches to choosing graphemes. Have ->codepointAt(), but also ->nthGrapheme() or something like it. There’s no need for grapheme versions of all functions, but others would need them. Though your approach has its own merits. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php