Hi,

It's interesting, and it can be the problem, but I think, the CGI.pm
way is not the good solution to decode the URL encoded string: if you
say chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and

s:g/A2/AC/?

Yes, don't care with it.

At first, I would like to tell you, that I'm not the master of encoding, I just have some experiences and I try to think logically.

I think we've discovered a bug in Pugs, but as I don't know that much
about UTF-8, I'd like to see the following confirmed first :).
  # This is what *should* happen:
  my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
  say $x.bytes;  # 3
  say $x.chars;  # 1

I don't agree.

  # This is what currently happens:
  my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
  say $x.bytes;  # 6
  say $x.chars;  # 3

I think this is the good solution.

chr(0xE2)=chr(226) is a valid character in unicode, it's à as I think. When I write chr(...), then it have to be mean, that I'm talking about *character*, and not a byte. If I'm talking about the #226 character, then it's internal representation will be 0x00E2 (I don't know why not 0x000000E2, but it's not so important). If I mean that, and I concatenating three characters (not three bytes), then it will be three characters and six bytes.

Comparision with perl5:
  $ perl -MEncode -we '
    my $x = decode "utf-8", chr(0xE2).chr(0x82).chr(0xAC);
    print length $x;
  '
  1 # (chars)

  $ perl -we '
    my $x = chr(0xE2).chr(0x82).chr(0xAC);
    print length $x;
  '
  3 # (bytes)

Your example is about the same thing I'm talking about, if you just concatenate characters, then their length will be 3 *characters* and *not bytes*, as you're wrinting it in the second perl5 example. If you do a decoding on this three characters, then it can be converted to one character.


Bye,
  Andras

Reply via email to