Re: [HACKERS] plperlu problem with utf8

Alex Hunsaker Thu, 16 Dec 2010 20:40:23 -0800

On Thu, Dec 16, 2010 at 20:24, David E. Wheeler <[email protected]> wrote:
> On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
>
>> You might argue this is a bug with URI::Escape as I *think* all uri's
>> will be utf8 encoded.  Anyway, I think postgres is doing the right
>> thing here.
>
> No, URI::Escape is fine. The issue is that if you don't decode text to Perl's 
> internal form, it assumes that it's Latin-1.


So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ?

Im saying they are not, and if you want \xc3\xa9 to be treated as
chr(233) you need to tell perl what encoding the string is in (err
well actually decode it so its in "perl space" as unicode characters
correctly).

> Maybe I'm misunderstanding, but it seems to me that:
>
> * String arguments passed to PL/Perl functions should be decoded from the 
> server encoding to Perl's internal representation before the function 
> actually gets them.

Currently postgres has 2 behaviors:
1) If the database is utf8, turn on the utf8 flag. According to the
perldoc snippet I quoted this should mean its a sequence of utf8 bytes
and should interpret it as such.
2) its not utf8, so we just leave it as octets.

So in "perl space" length($_[0]) returns the number of characters when
you pass in a multibyte char *not* the number of bytes.  Which is
correct, so um check we do that.  Right?

In the URI::Escape example we have:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
   use URI::Escape;
   warn(length($_[0]));
   return uri_unescape($_[0]); $$ LANGUAGE plperlu;

# select url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING: 38 at line 2

Ok that length looks right, just for grins lets try add one multibyte char:

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺');
WARNING:  39 CONTEXT:  PL/Perl function "url_decode" at line 2.
          url_decode
-------------------------------
 comment passer le rÃ©veillon☺
(1 row)

Still right, now lets try the utf8::decode version that "works".  Only
lets look at the length of the string we are returning instead of the
one we are passing in:

# CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar  AS $$
   use URI::Escape;
   utf8::decode($_[0]);
   my $str = uri_unescape($_[0]);
   warn(length($str));
   return $str;
$$ LANGUAGE plperlu;

# SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon');
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"
         url_decode
-----------------------------
 comment passer le réveillon
(1 row)

Looks harmless enough...

# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"
 length
--------
     27
(1 row)

Wait a minute... those lengths should match.

Post patch they do:
# SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon'));
WARNING:  28 at line 5.
CONTEXT:  PL/Perl function "url_decode"
 length
--------
     28
(1 row)

Still confused? Yeah me too.  Maybe this will help:

#!/usr/bin/perl
use URI::Escape;
my $str = uri_unescape("%c3%a9");
die "first match" if($str =~ m/\xe9/);
utf8::decode($str);
die "2nd match" if($str =~ m/\xe9/);

gives:
$ perl t.pl
2nd match at t.pl line 6.

see? Either uri_unescape() should be decoding that utf8() or you need
to do it *after* you call uri_unescape().  Hence the maybe it could be
considered a bug in uri_unescape().

> * Values returned from PL/Perl functions that are in Perl's internal 
> representation should be encoded into the server encoding before they're 
> returned.
> I didn't really follow all of the above; are you aiming for the same thing?

Yeah, the patch address this part.  Right now we just spit out
whatever the internal format happens to be.

Anyway its all probably clear as mud, this part of perl is one of the
hardest IMO.

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to