Hi,
It's interesting, and it can be the problem, but I think, the CGI.pm way is not the good solution to decode the URL encoded string: if you say chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and
s:g/A2/AC/?
Yes, don't care with it.
At first, I would like to tell you, that I'm not the master of encoding, I just have some experiences and I try to think logically.
I think we've discovered a bug in Pugs, but as I don't know that much about UTF-8, I'd like to see the following confirmed first :). # This is what *should* happen: my $x = chr(0xE2)~chr(0x82)~chr(0xAC); say $x.bytes; # 3 say $x.chars; # 1
I don't agree.
# This is what currently happens: my $x = chr(0xE2)~chr(0x82)~chr(0xAC); say $x.bytes; # 6 say $x.chars; # 3
I think this is the good solution.
chr(0xE2)=chr(226) is a valid character in unicode, it's à as I think. When I write chr(...), then it have to be mean, that I'm talking about *character*, and not a byte. If I'm talking about the #226 character, then it's internal representation will be 0x00E2 (I don't know why not 0x000000E2, but it's not so important). If I mean that, and I concatenating three characters (not three bytes), then it will be three characters and six bytes.
Comparision with perl5: $ perl -MEncode -we ' my $x = decode "utf-8", chr(0xE2).chr(0x82).chr(0xAC); print length $x; ' 1 # (chars)
$ perl -we ' my $x = chr(0xE2).chr(0x82).chr(0xAC); print length $x; ' 3 # (bytes)
Your example is about the same thing I'm talking about, if you just concatenate characters, then their length will be 3 *characters* and *not bytes*, as you're wrinting it in the second perl5 example. If you do a decoding on this three characters, then it can be converted to one character.
Bye, Andras