.graphemes methods

Larry Wall Sat, 26 Jun 2004 13:21:26 -0700

On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote:
: As currently designed, the String::bytes, String::codepoints, and 
: String::graphemes methods return the number of bytes, codepoints, 
: and graphemes, respectively, in the string they were called on.  I 
: would like to suggest that, when called in list context, these 
: methods return an array of strings split by bytes, codepoints, and 
: graphemes, respectively.
: 
: This would make it unambiguous whether certain string operations 
: referred to bytes, codepoints, or graphemes:
: 
:     $str.bytes[0].ord
:     $str.codepoints[0..4].join        #substr
: 
: As well as allowing some operations that are currently much more 
: difficult:
: 
:     $str.bytes[3].ord
:     $str.graphemes[144].lc
: 
: Issues:
:   * Limits lvalue substr (doesn't allow it to be a different size)
:     unless splice is used (or a substr method is also provided).


That all has to be looked at anyway.  What does "5" mean when you
pass it to substr, anyway?  (I've been trying to make it assume some
implicit unit based on the current lexical scope's Unicode level,
but issues remain.)  We have magical string positions that have
different numeric values depending on what units you view them as,
but at what point does a number like "5" get translated to such
a magical string position?

:   * Memory consumption.

Not necessarily, if the method merely returns a "view" of the string
without actually doing the split.

:   * A bit odd-looking.

I dunno--it reads pretty well.  Maybe these'll be heavily enough
used that we should Huffmanize them down a bit:

    $str.bytes
    $str.codes
    $str.graphs
    $str.letters

Though "letters" is a bit inadequate to describe language-dependent
graphemes, since it also divides any non-letters...I suppose we
could go with .characters if we don't mind forcing a heavily
overloaded word in one particular direction, culturally speaking.
Except, I'd kinda like to keep them starting with different letters.
(And maybe .chars should be reserved to mean whatever the default
unit is in the current lexical scope, as with substr() above.)

: Benefits:
:   * Removes ambiguity in an area that needs said ambiguity removed.
:   * Allows us to reuse constructs (e.g. slicing).
:   * Opens up a few previously-difficult constructs (like getting the
:     ord() of an arbitrary character).

I'd also point out that the scalar definitions fall out of it
naturally.

One other downside is that you might have to insert + in various
places to get the numeric interpretation.  But that could be
construed as self-dedocumentation.

Larry

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to