At 03:17 PM 6/20/2001 +0200, Bart Lateur wrote:
>On Tue, 19 Jun 2001 11:53:28 -0700, Hong Zhang wrote:
>
> >> * Do a substr operation by character and glyph
> >
> >The byte based is more useful. I have utf-8, and I want to substr it
> >to another utf-8. It is painful to convert it or linear search for
> >charaacter
> >position.
>
>I tend to agree.
>
>I currently use substr(), length() and read()/sysread(), based on a byte
>count. It's a mindset. Even if my encoding is in (16 bit) Unicode or
>UTF8, I still prefer to use bytes as my count base.
Sure, but that's at the language level. That's not where we're at.
>Personally, I would prefer if it stayed this way, i.e. that the raw,
>non-OO keywords for the above kept counting in bytes.
That one's Larry's call, and it's a language level thing anyway. The
internals should give you access to lengths by byte and character at least,
if not byte, character, and glyph.
>Why? Just imagine processing a binary file like a JPEG file, with
>embedded comments in (16-bit) Unicode. You wouldn't want Perl preventing
>you from treating this comment as Unicode, or having to process this
>entire binary file as Unicode, would you? I'd hate that. I want to
>remain in control.
Of course, but in that case you don't have UTF-8/16/32 data--you've binary
data. The scalar with the info shouldn't be tagged as anything but binary.
>I would not mind if OO versions of these words were smarter, and did
>their count in characters for whatever character mode they're set to.
>For example, if $string is a UTF8 object, then $string->length may
>return a length in (UTF8) characters.
Hassle Damian about it--I expect he's got a proposal for this already.
(Granted you might have to program in Klingon to get it... :)
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk