On Dec 17, 2010, at 9:32 PM, David Christensen wrote: > +1 on the original sentiment, but only for the case that we're dealing with > data that is passed in/out as arguments. In the case that the > server_encoding is UTF-8, this is as trivial as a few macros on the > underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= > byte soup), this is a trivial case of doing nothing with the conversion > regardless of data type. For any other server_encoding, the data would need > to be converted from the server_encoding to UTF-8, presumably using the > built-in conversions before passing it off to the first code path. A similar > handling would need to be done for the return values, again > datatype-dependent.
+1 > Recent upgrades of the Encode module included with perl 5.10+ have caused > issues wherein circular dependencies between Encode and Encode::Alias have > made it impossible to load in a Safe container without major pain. (There > may be some better options than I'd had on a previous project, given that > we're embedding our own interpreters and accessing more through the XS guts, > so I'm not ruling out this possibility completely). Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm. >> Well that works for me. I always use UTF8. Oleg, what was the encoding of >> your database where you saw the issue? > > I'm not sure what the current plperl runtime does as far as marshaling this, > but it would be fairly easy to ensure the parameters came in in perl's > internal format given a server_encoding of UTF8 and some type introspection > to identify the string-like types/text data. (Perhaps any type which had a > binary cast to text would be a sufficient definition here. Do domains > automatically inherit binary casts from their originating types?) Their labels are TEXT. I believe that the only type that should not be treated as text is bytea. >>> 2) its not utf8, so we just leave it as octets. >> >> Which mean's Perl will assume that it's Latin-1, IIUC. > > This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed > out earlier. This would produce bogus results for any non-UTF-8, non-ASCII, > non latin-1 encoding, even if it did not generally bite most people in > general usage. Agreed. > This example seems bogus; wouldn't length be 3 if this is the example text > this was run with? Additionally, since all ASCII is trivially UTF-8, I think > a better example would be using a string with hi-bit characters so if this > was improperly handled the lengths wouldn't match; length($all_ascii) == > length(encode_utf8($all_ascii)) vs length($hi_bit) < > length(encode_utf8($hi_bit)). I don't see that this test shows us much with > the test case as given. The is_utf8() function merely returns the state of > the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not > be set on ascii-only strings, which are still valid in the UTF-8 encoding), > nor does it indicate that there are no hi-bit characters in the string (i.e., > with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in > perl's internal format) with hi-bit characters will have the utf8 flag set, > but the return value of encode_utf8 will not, even though the underlying > data, as represented in perl will be identical). Sorry, I probably had a pasto there. how about this? CREATE OR REPLACE FUNCTION perlgets( TEXT ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$ my $text = shift; return_next { length => length $text, is_utf8 => utf8::is_utf8($text) ? 1 : 0 }; $$; utf8=# SELECT * FROM perlgets('“hello”'); length │ is_utf8 ────────┼───────── 7 │ t latin=# SELECT * FROM perlgets('“hello”'); length │ is_utf8 ────────┼───────── 11 │ f (Yes I used Latin-1 curly quotes in that last example). I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function. > >> In a latin-1 database: >> >> latin=# select * from perlgets('foo'); >> length │ is_utf8 >> ────────┼───────── >> 8 │ f >> (1 row) >> >> I would argue that in the latter case, is_utf8 should be true, too. That is, >> PL/Perl should decode from Latin-1 to Perl's internal form. > > See above for discussion of the is_utf8 flag; if we're dealing with latin-1 > data or (more precisely in this case) data that has not been decoded from the > server_encoding to perl's internal format, this would exactly be the > expectation for the state of that flag. Right. I think that it *should* be decoded. >> Interestingly, when I created a function that takes a bytea argument, utf8 >> was *still* enabled in the utf-8 database. That doesn't seem right to me. > > I'm not sure what you mean here, but I do think that if bytea is identifiable > as one of the input types, we should do no encoding on the data itself, which > would indicate that the utf8 flag for that variable would be unset. Right. > If this is not currently handled this way, I'd be a bit surprised, as bytea > should just be an array of bytes with no character semantics attached to it. It looks as though it is not handled that way. The utf8 flag *is* set on a bytea string passed to a PL/Perl function in a UTF-8 database. > As shown above, the character length for the example should be 27, while the > octet length for the UTF-8 encoded version is 28. I've reviewed the source > of URI::Escape, and can say definitively that: a) regular uri_escape does not > handle > 255 code points in the encoding, but there exists a uri_escape_utf8 > which will convert the source string to UTF8 first and then escape the > encoded value, and b) uri_unescape has *no* logic in it to automatically > decode from UTF8 into perl's internal format (at least as far as the version > that I'm looking at, which came with 5.10.1). Right. > -1; if you need to decode from an octets-only encoding, it's your > responsibility to do so after you've unescaped it. Perhaps later versions of > the URI::Escape module contain a uri_unescape_utf8() function, but it's > trivially: sub uri_unescape_utf8 { Encode::decode_utf8(uri_unescape(shift))}. > This is definitely not a bug in uri_escape, as it is only defined to return > octets. Right, I think we're agreed on that count. I wouldn't mind seeing a uri_unescape_utf8() though, as it might prevent some confusion. >>> Yeah, the patch address this part. Right now we just spit out >>> whatever the internal format happens to be. >> >> Ah, excellent. > > I agree with the sentiments that: data (server_encoding) -> function > parameters (-> perl internal) -> function return (-> server_encoding). This > should be for any character-type data insofar as it is feasible, but ISTR > there is already datatype-specific marshaling occurring. Dunno about that. > There is definitely a lot of confusion surrounding perl's handling of > character data; I hope this was able to clear a few things up. Yes, it helped, thanks! David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers