Using Utf8:
use Utf8;
If I have some text:
my $text = "abcdefghi�"; # note the �
and I convert it to a fake Unicode string (fake, because you couldn't do
this with real Unicode charsets like Chinese) by sticking nulls in between
letters:
sub AnsiToUnicode($$)
{
my ($sAnsi) = @_;
my $lLength = length($sAnsi);
my $sUnicode = "";
for(my $i = 0; $i < $lLength; $i++)
{
$sUnicode .= substr $sAnsi, $i, 1;
$sUnicode .= "\0";
}
return $sUnicode;
}
my $uni_text = AnsiToUnicode($text);
Then, I convert it to Utf8 ( I know, why not go straight from ansi to
utf8?, it's a long story -- basically, we work in Unicode here, so I've
never had to go from ansi to utf8, but for the sake of the question, let's
do it this way)
sub UnicodeToUtf8($$$)
{
my($bIsBigEndian, $sText) = @_;
my $sReturn = "";
my $lLength = length($sText);
for(my $i = 0; $i < $lLength; $i += 2)
{
my $sChar = substr($sText, $i, 2);
my $lByte1;
my $lByte2;
if($bIsBigEndian == 0)
{
$lByte1 = ord(substr($sChar, 1, 1));
$lByte2 = ord(substr($sChar, 0, 1));
}
else
{
$lByte1 = ord(substr($sChar, 0, 1));
$lByte2 = ord(substr($sChar, 1, 1));
}
my $lUni = ($lByte1 * 0x100) + $lByte2;
if ($lUni < 0x80)
{
$sReturn .= chr($lUni);
}
elsif ($lUni < 0x800)
{
$sReturn .= chr(0xc0 | $lUni >> 6);
$sReturn .= chr(0x80 | $lUni & 0x3f);
}
elsif ($lUni < 0x10000)
{
$sReturn .= chr(0xe0 | $lUni >> 12);
$sReturn .= chr(0x80 | $lUni >> 6 & 0x3f);
$sReturn .= chr(0x80 | $lUni & 0x3f);
}
elsif ($lUni < 0x200000)
{
$sReturn .= chr(0xf0 | $lUni >> 18);
$sReturn .= chr(0x80 | $lUni >> 12 & 0x3f);
$sReturn .= chr(0x80 | $lUni >> 6 & 0x3f);
$sReturn .= chr(0x80 | $lUni & 0x3f);
}
}
return $sReturn;
}
my $utf8_text = UnicodeToUtf8(0, $uni_text); # false BigEndian parameter
since I'm on Win2000
now, we finally get to the heart of the problem.
print $utf8_text; # produces abcdefghiæ
that is, two characters for the � character in the string. This is due,
I'm assuming, to weirdness with the Utf8 pragma. The problem is this -
print length($utf8_text) . "\n"; # 11 !!!!!
Anyone have any experience with this? I've checked the utf8 manpage, but
they gloss over length(), including it in a list of functions that should
continue to operate on characters, not bytes. In fact, this is the problem
-- utf8 seems to consider � two characters.
Thanks in advance
Aaron Craig
Programming
iSoftitler.com