On Wed, Jan 25, 2017 at 03:31:36PM -0500, Viktor Dukhovni wrote: > > The reason for LATIN1 is that all raw octet strings are valid LATIN1, > so whatever non-ASCII garbage comes down the wire, database lookups > won't tempfail with query encoding errors. Absent mechanisms like > SMTPUTF8 non-ASCII data in SMTP commands is undefined, and so no > particular encoding of non-ASCII characters can be assumed.
Aha. Yeah, I can see that. The problem, of course, is that while all LATIN1 octets are valid UTF-8, they're not the same UTF-8 _characters_. So it's not even possible to do EAI with this set up with just a limited subset of characters enforced at input time (which would otherwise be possible). It strikes me that for this reason the C locale would be better than LATIN1 in Postgres. At least in that case you could use the raw data, although of course you could well end up with garbage anyway. But you wouldn't get an encoding error. Or are you worried about a back end whose encoding is UTF8, so that when you send the "unencoded" value you get an encoding error on comparison? I suppose that could be a problem (though I haven't tested it -- this is why I use UTF8 for everything :). > Even fancier would be dynamically adjusting the database encoding to > UTF-8 when the client includes the "SMTPUTF8" ESMTP parameter in its > "MAIL" command. Since, presumably, in that case all non-ASCII data > in the SMTP dialogue are then UTF-8 encoded (and can be validated > as such before query construction). That validation could still fail, though, if you had different versions of Unicode on the different systems, couldn't it? (I'm imagining the case where a later version of Unicode is on the mail system, but it's talking to an older-Unicode database backend. I _think_ the mail server could still send code points that cause an encoding error, even if the mail server did the validation. Thanks, A -- Andrew Sullivan a...@anvilwalrusden.com