.graphemes methods

Austin Hastings Mon, 28 Jun 2004 14:03:19 -0700

--- Jonadab the Unsightly One <[EMAIL PROTECTED]> wrote:
> Larry Wall <[EMAIL PROTECTED]> writes:
> 
> > (I've been trying to make it assume some implicit unit based on the
> > current lexical scope's Unicode level, but issues remain.)  We have
> > magical string positions that have different numeric values
> > depending on what units you view them as, but at what point does a
> > number like "5" get translated to such a magical string position?
> 
> It would be possible to have right-associative operators (that bind
> at least more tightly than comma and possibly very tightly) and
> convert a number to one of these objects, so that we can do stuff 
> like this:
> 
> substr($string, 2 bytes, 4 bytes) = $substitute;
> 
> Then if you pass a plain number to substr it could either assume
> something (possibly generating a warning) or spit an error, depending
> on some feature of the current lexical scope.


A couple of alternatives:

  substr.bytes($string, 2, 4) = $substitute;

  substr($string.bytes, 2, 4) = $substitute;

  # Make it a pragma
  use String(bytes);         
  substr($string, 2, 4) = substitute;

  # Make it a global mode
  set_string_mode(bytes);
  substr($string, 2, 4) = substitute;

  # Make it an object mode
  $string.access_mode(bytes);
  substr($string, 2, 4) = $substitute;

> The word "bytes" is clearly much too long, though, much less
> "graphemes" or "codepoints".  I thought about this:
> 
> substr($string, 2b, 4b) = $substitute;

Problems with:
 
  substr($string, 0b, 1b) = $substitute;

Is that binary or bytes? Also:

  substr($string, $start b, $end b) = $substitute;

Looks unintuitive.

> With presumably g and c for graphemes and codepoints, but I rather
> suspect that might conflict with some other existing syntax (though I
> can't think of anything in particular).

0c? 0x16c ?

> And I can't think of another abbreviation that would be remotely
> intuitive.
> 
> There's also the possibility of bsubstr and so on, but that leads us
> down the path of C, having a hillion bajillion functions with names
> like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
> that, IMO.
> 
> > I dunno--it reads pretty well.  Maybe these'll be heavily enough
> > used that we should Huffmanize them down a bit:
> >
> >     $str.bytes
> >     $str.codes
> >     $str.graphs
> >     $str.letters
> 
> codes and graphs is better than codepoints and graphemes, at least.

In certain (IMO large) sectors of the Perl community, string processing
is just about all the work there is. I submit that there needs to be a
way to drive the token length to 0: either a pragma, or a global mode,
or a type definition.

> 
> > Though "letters" is a bit inadequate to describe language-dependent
> > graphemes, since it also divides any non-letters...I suppose we
> > could go with .characters if we don't mind forcing a heavily
> > overloaded word in one particular direction, culturally speaking.
> > Except, I'd kinda like to keep them starting with different
> > letters.
> > (And maybe .chars should be reserved to mean whatever the default
> > unit is in the current lexical scope, as with substr() above.)
>   
> You could coin the abbreviation ligs, for Language Independent
> Graphemes.  Then some ingenious rascal can create a pragma or
> whatever that allows $str.b, $str.c, $str.g, and $str.l for 
> fans of terseness.

As opposed to 'ligs' meaning ligatures? Fraught with peril. :-)

To me, the right thing to do is provide a 'default' way to work, and
allow for changing that default to some other way. The obvious defaults
are 'bytes', which gives C-like behavior (unpopular though that may
presently be) and imposes little or no conceptual strain but likewise
no enormous benefit, and 'graphemes'.

I like graphemes for the default because I hate and fear graphemes. The
whole *code thing just crawls right in my ear, so having the language
transparently support it would be a win. Having the language force me
to understand this stuff, if it cannot be transparently supported,
would also be a win, on a longer time scale.

=Austin

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to