On Wed, Dec 8, 2010 at 14:15, Oleg Bartunov <o...@sai.msu.su> wrote: > On Wed, 8 Dec 2010, David E. Wheeler wrote: > >> On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote: >> >>> adding utf8::decode($_[0]) solves the problem: >> Hrm. Ideally all strings passed to PL/Perl functions would be decoded. > > yes, this is what I expected.
Erm... no. The in and out from perl AFAICT works fine (minus a caveat I found, see the end of the mail). The problem here is you have the url encoded utf8 bytes "%C3%A9". URL::Encode basically does chr(hex("c3")) and chr(hex("a9"));. Perl, generally, will treat that as two separate unicode code points. So you end up with two characters (one with a code point of 0xc3, the other with 0xa9) instead of the one character you expect. If you want \xc3\xa9 to be treated as a utf8 byte sequence, you need to tell perl those bytes are utf8 by decoding it. Heck for all we know instead of it being a utf8 sequence, it could have been a utf16 sequence. You might argue this is a bug with URI::Escape as I *think* all uri's will be utf8 encoded. Anyway, I think postgres is doing the right thing here. In playing around I did find what I think is a postgres bug. Perl has 2 ways it can store things internally. per perldoc perlunicode: Using Unicode in XS ... What the "UTF8" flag means is that the sequence of octets in the representation of the scalar is the sequence of UTF-8 encoded code points of the characters of a string. The "UTF8" flag being off means that each octet in this representation encodes a single character with code point 0..255 within the string. Postgres always prints whatever the internal representation happens to be ignoring the UTF8 flag and the server encoding. # create or replace function chr(i int, i2 int) returns text as $$ return chr($_[0]).chr($_[1]); $$ language plperlu; CREATE FUNCTION # show server_encoding; server_encoding ----------------- SQL_ASCII # SELECT length(chr(128, 33)); length -------- 2 # SELECT length(chr(128, 333)); length -------- 4 Grr that should error out with "Invalid server encoding", or worst case should return a length of 3 (it utf8 encoded 128 into 2 bytes instead of leaving it as 1). In this case the 333 causes perl store it internally as utf8. Now on a utf8 database: # show server_encoding; server_encoding ----------------- UTF8 # SELECT length(chr(128, 33)); ERROR: invalid byte sequence for encoding "UTF8": 0x80 CONTEXT: PL/Perl function "chr" # SELECT length(chr(128, 333)); CONTEXT: PL/Perl function "chr" length -------- 2 Same thing here, we just end up using the internal format. In one case it works in the other it does not. The main point being, most of the time it *happens* to work. But its really just by chance. I think what we should do is use SvPVutf8() when we are UTF8 instead of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string in perls current internal format (maybe unicode, maybe a utf8 byte sequence). While SvPVutf8 will always give us utf8 (may or may not be valid!) encoded string. Something like the attached. Thoughts? Im not very happy with the non utf8 case-- The elog(ERROR, "invalid byte sequence") is a total cop-out yes. But I did not see a good solution short of hand rolling our own version of sv_utf8_downgrade(). Is it worth it?
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c index 5595baa..8a9d677 100644 --- a/src/pl/plperl/plperl.c +++ b/src/pl/plperl/plperl.c @@ -254,7 +254,31 @@ sv2text_mbverified(SV *sv) * length, whatever uses the "verified" value might get something quite * weird. */ - val = SvPV(sv, len); + + /* + * When we are in an UTF8 encoding we want to make sure we get back a utf8 + * byte sequence instead of whatever perls internal format happens to be. + * + * Non UTF8 will just treat everything as bytes/latin1 that is + * SvPVutf8(chr(170)) len == 2 + * SvPVbyte(chr(170)) len == 1 + * SvPV(chr(170))) len == 1 || 2 + */ + if (GetDatabaseEncoding() == PG_UTF8) + val = SvPVutf8(sv, len); + else + { + /* + * See if we can safely represent our string as bytes if not bail out + * otherwise perl dies with "Wide Character" and takes the backend down + * with it + */ + if (sv_utf8_downgrade(sv, true)) + val = SvPVbyte(sv, len); + else + elog(ERROR, "invalid byte sequence"); + } + pg_verifymbstr(val, len, false); return val; }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers