On Wed, Jan 25, 2017 at 03:31:36PM -0500, Viktor Dukhovni wrote:
> 
> The reason for LATIN1 is that all raw octet strings are valid LATIN1,
> so whatever non-ASCII garbage comes down the wire, database lookups
> won't tempfail with query encoding errors.  Absent mechanisms like
> SMTPUTF8 non-ASCII data in SMTP commands is undefined, and so no
> particular encoding of non-ASCII characters can be assumed.

Aha.  Yeah, I can see that.  The problem, of course, is that while
all LATIN1 octets are valid UTF-8, they're not the same UTF-8
_characters_.  So it's not even possible to do EAI with this set up
with just a limited subset of characters enforced at input time (which
would otherwise be possible).

It strikes me that for this reason the C locale would be better than
LATIN1 in Postgres.  At least in that case you could use the raw data,
although of course you could well end up with garbage anyway.  But you
wouldn't get an encoding error.  Or are you worried about a back end
whose encoding is UTF8, so that when you send the "unencoded" value
you get an encoding error on comparison?  I suppose that could be a
problem (though I haven't tested it -- this is why I use UTF8 for
everything :).

> Even fancier would be dynamically adjusting the database encoding to
> UTF-8 when the client includes the "SMTPUTF8" ESMTP parameter in its
> "MAIL" command.  Since, presumably, in that case all non-ASCII data
> in the SMTP dialogue are then UTF-8 encoded (and can be validated
> as such before query construction).

That validation could still fail, though, if you had different
versions of Unicode on the different systems, couldn't it?  (I'm
imagining the case where a later version of Unicode is on the mail
system, but it's talking to an older-Unicode database backend.  I
_think_ the mail server could still send code points that cause an
encoding error, even if the mail server did the validation.

Thanks,

A

-- 
Andrew Sullivan
a...@anvilwalrusden.com

Reply via email to