Re: [HACKERS] plperlu problem with utf8

Alex Hunsaker Thu, 16 Dec 2010 18:40:42 -0800

On Wed, Dec 8, 2010 at 14:15, Oleg Bartunov <[email protected]> wrote:
> On Wed, 8 Dec 2010, David E. Wheeler wrote:
>
>> On Dec 8, 2010, at 8:13 AM, Oleg Bartunov wrote:
>>
>>> adding utf8::decode($_[0]) solves the problem:
>> Hrm. Ideally all strings passed to PL/Perl functions would be decoded.
>
> yes, this is what I expected.


Erm... no.  The in and out from perl AFAICT works fine (minus a caveat
I found, see the end of the mail).

The problem here is you have the url encoded utf8 bytes "%C3%A9".
URL::Encode basically does chr(hex("c3")) and chr(hex("a9"));.  Perl,
generally, will treat that as two separate unicode code points.  So
you end up with two characters (one with a code point of 0xc3, the
other with 0xa9) instead of the one character you expect. If you want
\xc3\xa9 to be treated as a utf8 byte sequence, you need to tell perl
those bytes are utf8 by decoding it.  Heck for all we know instead of
it being a utf8 sequence, it could have been a utf16 sequence.

You might argue this is a bug with URI::Escape as I *think* all uri's
will be utf8 encoded.  Anyway, I think postgres is doing the right
thing here.

In playing around I did find what I think is a postgres bug.  Perl has
2 ways it can store things internally.  per perldoc perlunicode:

Using Unicode in XS
... What the "UTF8" flag means is that the sequence of octets in the
representation of the scalar is the sequence of UTF-8 encoded code
points of the characters of a string.  The "UTF8" flag being off means
that each octet in this representation encodes a single character with
code point 0..255 within the string.

Postgres always prints whatever the internal representation happens to
be ignoring the UTF8 flag and the server encoding.

# create or replace function chr(i int, i2 int) returns text as $$
return chr($_[0]).chr($_[1]); $$ language plperlu;
CREATE FUNCTION

# show server_encoding;
 server_encoding
-----------------
 SQL_ASCII

# SELECT length(chr(128, 33));
 length
--------
      2

# SELECT length(chr(128, 333));
 length
--------
      4

Grr that should error out with "Invalid server encoding", or worst
case should return a length of 3 (it utf8 encoded 128 into 2 bytes
instead of leaving it as 1).  In this case the 333 causes perl store
it internally as utf8.

Now on a utf8 database:

# show server_encoding;
 server_encoding
-----------------
 UTF8

# SELECT length(chr(128, 33));
ERROR:  invalid byte sequence for encoding "UTF8": 0x80
CONTEXT:  PL/Perl function "chr"

# SELECT length(chr(128, 333));
CONTEXT:  PL/Perl function "chr"
 length
--------
      2

Same thing here, we just end up using the internal format.  In one
case it works in the other it does not.  The main point being, most of
the time it *happens* to work.  But its really just by chance.

I think what we should do is use SvPVutf8() when we are UTF8 instead
of SvPV in sv2text_mbverified().  SvPV gives us a pointer to a string
in perls current internal format (maybe unicode, maybe a utf8 byte
sequence).  While SvPVutf8 will always give us utf8 (may or may not be
valid!) encoded string.

Something like the attached.  Thoughts? Im not very happy with the non
utf8 case--  The elog(ERROR, "invalid byte sequence") is a total
cop-out yes.  But I did not see a good solution short of hand rolling
our own version of sv_utf8_downgrade().  Is it worth it?

diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 5595baa..8a9d677 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -254,7 +254,31 @@ sv2text_mbverified(SV *sv)
 	 * length, whatever uses the "verified" value might get something quite
 	 * weird.
 	 */
-	val = SvPV(sv, len);
+
+	/*
+	 * When we are in an UTF8 encoding we want to make sure we get back a utf8
+	 * byte sequence instead of whatever perls internal format happens to be.
+	 *
+	 * Non UTF8 will just treat everything as bytes/latin1 that is
+	 * SvPVutf8(chr(170)) len == 2
+	 * SvPVbyte(chr(170)) len == 1
+	 * SvPV(chr(170))) len == 1 || 2
+	 */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		val = SvPVutf8(sv, len);
+	else
+	{
+		/*
+		 * See if we can safely represent our string as bytes if not bail out
+		 * otherwise perl dies with "Wide Character" and takes the backend down
+		 * with it
+		 */
+		if (sv_utf8_downgrade(sv, true))
+			val = SvPVbyte(sv, len);
+		else
+			elog(ERROR, "invalid byte sequence");
+	}
+
 	pg_verifymbstr(val, len, false);
 	return val;
 }

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperlu problem with utf8

Reply via email to