--- Jonadab the Unsightly One <[EMAIL PROTECTED]> wrote: > Larry Wall <[EMAIL PROTECTED]> writes: > > > (I've been trying to make it assume some implicit unit based on the > > current lexical scope's Unicode level, but issues remain.) We have > > magical string positions that have different numeric values > > depending on what units you view them as, but at what point does a > > number like "5" get translated to such a magical string position? > > It would be possible to have right-associative operators (that bind > at least more tightly than comma and possibly very tightly) and > convert a number to one of these objects, so that we can do stuff > like this: > > substr($string, 2 bytes, 4 bytes) = $substitute; > > Then if you pass a plain number to substr it could either assume > something (possibly generating a warning) or spit an error, depending > on some feature of the current lexical scope.
A couple of alternatives: substr.bytes($string, 2, 4) = $substitute; substr($string.bytes, 2, 4) = $substitute; # Make it a pragma use String(bytes); substr($string, 2, 4) = substitute; # Make it a global mode set_string_mode(bytes); substr($string, 2, 4) = substitute; # Make it an object mode $string.access_mode(bytes); substr($string, 2, 4) = $substitute; > The word "bytes" is clearly much too long, though, much less > "graphemes" or "codepoints". I thought about this: > > substr($string, 2b, 4b) = $substitute; Problems with: substr($string, 0b, 1b) = $substitute; Is that binary or bytes? Also: substr($string, $start b, $end b) = $substitute; Looks unintuitive. > With presumably g and c for graphemes and codepoints, but I rather > suspect that might conflict with some other existing syntax (though I > can't think of anything in particular). 0c? 0x16c ? > And I can't think of another abbreviation that would be remotely > intuitive. > > There's also the possibility of bsubstr and so on, but that leads us > down the path of C, having a hillion bajillion functions with names > like fgets, stoi, and fstrnclost. Having sprintf is quite enough of > that, IMO. > > > I dunno--it reads pretty well. Maybe these'll be heavily enough > > used that we should Huffmanize them down a bit: > > > > $str.bytes > > $str.codes > > $str.graphs > > $str.letters > > codes and graphs is better than codepoints and graphemes, at least. In certain (IMO large) sectors of the Perl community, string processing is just about all the work there is. I submit that there needs to be a way to drive the token length to 0: either a pragma, or a global mode, or a type definition. > > > Though "letters" is a bit inadequate to describe language-dependent > > graphemes, since it also divides any non-letters...I suppose we > > could go with .characters if we don't mind forcing a heavily > > overloaded word in one particular direction, culturally speaking. > > Except, I'd kinda like to keep them starting with different > > letters. > > (And maybe .chars should be reserved to mean whatever the default > > unit is in the current lexical scope, as with substr() above.) > > You could coin the abbreviation ligs, for Language Independent > Graphemes. Then some ingenious rascal can create a pragma or > whatever that allows $str.b, $str.c, $str.g, and $str.l for > fans of terseness. As opposed to 'ligs' meaning ligatures? Fraught with peril. :-) To me, the right thing to do is provide a 'default' way to work, and allow for changing that default to some other way. The obvious defaults are 'bytes', which gives C-like behavior (unpopular though that may presently be) and imposes little or no conceptual strain but likewise no enormous benefit, and 'graphemes'. I like graphemes for the default because I hate and fear graphemes. The whole *code thing just crawls right in my ear, so having the language transparently support it would be a win. Having the language force me to understand this stuff, if it cannot be transparently supported, would also be a win, on a longer time scale. =Austin