Utf8 and length()

Aaron Craig Fri, 15 Jun 2001 03:34:36 -0700
Using Utf8:

use Utf8;

If I have some text:

my $text = "abcdefghić"; # note the ć

and I convert it to a fake Unicode string (fake, because you couldn't do 
this with real Unicode charsets like Chinese) by sticking nulls in between 
letters:

sub AnsiToUnicode($$)
   {
   my ($sAnsi) = @_;
   my $lLength = length($sAnsi);
   my $sUnicode = "";
   for(my $i = 0; $i < $lLength; $i++)
     {
     $sUnicode .= substr $sAnsi, $i, 1;
     $sUnicode .= "\0";
     }
   return $sUnicode;
   }

my $uni_text = AnsiToUnicode($text);

Then, I convert it to Utf8 ( I know, why not go straight from ansi to 
utf8?, it's a long story -- basically, we work in Unicode here, so I've 
never had to go from ansi to utf8, but for the sake of the question, let's 
do it this way)

sub UnicodeToUtf8($$$)
        {
        my($bIsBigEndian, $sText) = @_;
        my $sReturn = "";
        my $lLength = length($sText);
        for(my $i = 0; $i < $lLength; $i += 2)
                {
                my $sChar = substr($sText, $i, 2);
                my $lByte1;
                my $lByte2;
                if($bIsBigEndian == 0)
                        {
                        $lByte1 = ord(substr($sChar, 1, 1));
                        $lByte2 = ord(substr($sChar, 0, 1));
                        }
                else
                        {
                        $lByte1 = ord(substr($sChar, 0, 1));
                        $lByte2 = ord(substr($sChar, 1, 1));
                        }
                my $lUni = ($lByte1 * 0x100) + $lByte2;
          if ($lUni < 0x80)
                        {
                        $sReturn .=  chr($lUni);
                        }
                elsif ($lUni < 0x800)
                        {
                        $sReturn .= chr(0xc0 | $lUni >> 6);
                        $sReturn .= chr(0x80 | $lUni & 0x3f);
                        }
                elsif ($lUni < 0x10000)
                        {
                        $sReturn .= chr(0xe0 | $lUni >> 12);
                        $sReturn .= chr(0x80 | $lUni >> 6 & 0x3f);
                        $sReturn .= chr(0x80 | $lUni & 0x3f);
                        }
                elsif ($lUni < 0x200000)
                        {
                        $sReturn .= chr(0xf0 | $lUni >> 18);
                        $sReturn .= chr(0x80 | $lUni >> 12 & 0x3f);
                        $sReturn .= chr(0x80 | $lUni >> 6 & 0x3f);
                        $sReturn .= chr(0x80 | $lUni & 0x3f);
                        }
                }
        return $sReturn;
        }

my $utf8_text = UnicodeToUtf8(0, $uni_text); # false BigEndian parameter 
since I'm on Win2000

now, we finally get to the heart of the problem.

print $utf8_text; # produces abcdefghiĂ¦

that is, two characters for the ć character in the string.  This is due, 
I'm assuming, to weirdness with the Utf8 pragma.  The problem is this -

print length($utf8_text) . "\n"; # 11 !!!!!

Anyone have any experience with this?  I've checked the utf8 manpage, but 
they gloss over length(), including it in a list of functions that should 
continue to operate on characters, not bytes.  In fact, this is the problem 
-- utf8 seems to consider ć two characters.

Thanks in advance
Aaron Craig
Programming
iSoftitler.com
Utf8 and length()

Reply via email to