.graphemes methods

Jonadab the Unsightly One Mon, 28 Jun 2004 08:28:50 -0700

Larry Wall <[EMAIL PROTECTED]> writes:

> That all has to be looked at anyway.  What does "5" mean when you
> pass it to substr, anyway?


I was just going to ask about substrings, and then didn't because I
figured that had been hashed out already and I'd missed it...

> (I've been trying to make it assume some implicit unit based on the
> current lexical scope's Unicode level, but issues remain.)  We have
> magical string positions that have different numeric values
> depending on what units you view them as, but at what point does a
> number like "5" get translated to such a magical string position?

It would be possible to have right-associative operators (that bind at
least more tightly than comma and possibly very tightly) and convert a
number to one of these objects, so that we can do stuff like this:

substr($string, 2 bytes, 4 bytes) = $substitute;

Then if you pass a plain number to substr it could either assume
something (possibly generating a warning) or spit an error, depending
on some feature of the current lexical scope.

The word "bytes" is clearly much too long, though, much less
"graphemes" or "codepoints".  I thought about this:

substr($string, 2b, 4b) = $substitute;

With presumably g and c for graphemes and codepoints, but I rather
suspect that might conflict with some other existing syntax (though I
can't think of anything in particular).

And I can't think of another abbreviation that would be remotely
intuitive.

There's also the possibility of bsubstr and so on, but that leads us
down the path of C, having a hillion bajillion functions with names
like fgets, stoi, and fstrnclost.  Having sprintf is quite enough of
that, IMO.

> I dunno--it reads pretty well.  Maybe these'll be heavily enough
> used that we should Huffmanize them down a bit:
>
>     $str.bytes
>     $str.codes
>     $str.graphs
>     $str.letters

codes and graphs is better than codepoints and graphemes, at least.

> Though "letters" is a bit inadequate to describe language-dependent
> graphemes, since it also divides any non-letters...I suppose we
> could go with .characters if we don't mind forcing a heavily
> overloaded word in one particular direction, culturally speaking.
> Except, I'd kinda like to keep them starting with different letters.
> (And maybe .chars should be reserved to mean whatever the default
> unit is in the current lexical scope, as with substr() above.)
  
You could coin the abbreviation ligs, for Language Independent
Graphemes.  Then some ingenious rascal can create a pragma or whatever
that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"[EMAIL PROTECTED]/ --";$\=$ ;-> ();print$/

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to