On Thu, Dec 16, 2010 at 20:24, David E. Wheeler <da...@kineticode.com> wrote: > On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote: > >> You might argue this is a bug with URI::Escape as I *think* all uri's >> will be utf8 encoded. Anyway, I think postgres is doing the right >> thing here. > > No, URI::Escape is fine. The issue is that if you don't decode text to Perl's > internal form, it assumes that it's Latin-1.
So... you are saying "\xc3\xa9" eq "\xe9" or chr(233) ? Im saying they are not, and if you want \xc3\xa9 to be treated as chr(233) you need to tell perl what encoding the string is in (err well actually decode it so its in "perl space" as unicode characters correctly). > Maybe I'm misunderstanding, but it seems to me that: > > * String arguments passed to PL/Perl functions should be decoded from the > server encoding to Perl's internal representation before the function > actually gets them. Currently postgres has 2 behaviors: 1) If the database is utf8, turn on the utf8 flag. According to the perldoc snippet I quoted this should mean its a sequence of utf8 bytes and should interpret it as such. 2) its not utf8, so we just leave it as octets. So in "perl space" length($_[0]) returns the number of characters when you pass in a multibyte char *not* the number of bytes. Which is correct, so um check we do that. Right? In the URI::Escape example we have: # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$ use URI::Escape; warn(length($_[0])); return uri_unescape($_[0]); $$ LANGUAGE plperlu; # select url_decode('comment%20passer%20le%20r%C3%A9veillon'); WARNING: 38 at line 2 Ok that length looks right, just for grins lets try add one multibyte char: # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon☺'); WARNING: 39 CONTEXT: PL/Perl function "url_decode" at line 2. url_decode ------------------------------- comment passer le réveillon☺ (1 row) Still right, now lets try the utf8::decode version that "works". Only lets look at the length of the string we are returning instead of the one we are passing in: # CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$ use URI::Escape; utf8::decode($_[0]); my $str = uri_unescape($_[0]); warn(length($str)); return $str; $$ LANGUAGE plperlu; # SELECT url_decode('comment%20passer%20le%20r%C3%A9veillon'); WARNING: 28 at line 5. CONTEXT: PL/Perl function "url_decode" url_decode ----------------------------- comment passer le réveillon (1 row) Looks harmless enough... # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon')); WARNING: 28 at line 5. CONTEXT: PL/Perl function "url_decode" length -------- 27 (1 row) Wait a minute... those lengths should match. Post patch they do: # SELECT length(url_decode('comment%20passer%20le%20r%C3%A9veillon')); WARNING: 28 at line 5. CONTEXT: PL/Perl function "url_decode" length -------- 28 (1 row) Still confused? Yeah me too. Maybe this will help: #!/usr/bin/perl use URI::Escape; my $str = uri_unescape("%c3%a9"); die "first match" if($str =~ m/\xe9/); utf8::decode($str); die "2nd match" if($str =~ m/\xe9/); gives: $ perl t.pl 2nd match at t.pl line 6. see? Either uri_unescape() should be decoding that utf8() or you need to do it *after* you call uri_unescape(). Hence the maybe it could be considered a bug in uri_unescape(). > * Values returned from PL/Perl functions that are in Perl's internal > representation should be encoded into the server encoding before they're > returned. > I didn't really follow all of the above; are you aiming for the same thing? Yeah, the patch address this part. Right now we just spit out whatever the internal format happens to be. Anyway its all probably clear as mud, this part of perl is one of the hardest IMO. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers