Re: Add UTF8 support in PostgreSQL lookup table interface

Viktor Dukhovni Fri, 24 Aug 2018 11:56:03 -0700

> On Aug 24, 2018, at 2:33 PM, Andrew Sullivan <a...@anvilwalrusden.com> wrote:
> 
> On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote:
> 
>> Absent any indication of character set from the client, there was
>> no way to know what encoding any particular non-ASCII octet
>> string may be using, so the code was optimized to avoid spurious
>> database string conversion errors, by using an encoding that
>> would accept any octet-string, garbage-in -> garbage-out.
> 
> Unless I misunderstand you (and I might well be doing so) I don't
> think LATIN1 works that way in Postgres.


The point isn't really about Postgres per-se, but rather that all
octet strings are valid encodings of some LATIN1 string, since
every octet is a valid LATIN1 code-point.  The same cannot be said
of UTF8, since random octet strings are generally invalid.

> If you're going
> to get multibyte strings, I don't see how LATIN1 is any less likely to
> throw errors than UTF8

We don't get "multi-byte" strings, absent SMTPUTF8 we get octet strings
that are either ASCII, or something else unspecified that violates the
SMTP protocol.  The something else unspecified will be some valid (
intended or otherwise) LATIN1 string.  Its use in database queries
with a LATIN1 client encoding will not throw perplexing errors.

> It's
> true that the characters in that case will probably map, but they'll
> fail anyway since the match could as easily be wrong as right).

The sending system violated the SMTP protocol, garbage-in -> garbage out.
The folks who tend to be lazy about encodings have generally been the
ones with some flavour of a single-byte ISO-8859-X encoding, so we'll
continue to treat unspecified 8-bit input as LATIN1.

>> This means that we'd a way to dynamically update the client
>> encoding of the database connection to UTF8 when appropriate
>> and revert it LATIN1 when the client encoding is unspecified.
> 
> You can do this in libpq and also in commands passed on a regular connection:
> 
>    SET CLIENT_ENCODING TO 'value';

Yes, but that'd have to be done by the dictionary lookup layer,
possibly in proxymap, based on a suitable signal from the lookup
client, but the low-level API (dict_get()) does not presently
support any per-lookup flags.  So we'd need dict_get_ex() that
takes a new utf8 flag and supporting changes throughout the
code.  This is a major change.

Perhaps there's a clever way to avoid this, but I am not seeing it yet.

-- 
        Viktor.

Re: Add UTF8 support in PostgreSQL lookup table interface

Reply via email to