> On Aug 24, 2018, at 2:33 PM, Andrew Sullivan <a...@anvilwalrusden.com> wrote: > > On Thu, Aug 23, 2018 at 07:02:27PM -0400, Viktor Dukhovni wrote: > >> Absent any indication of character set from the client, there was >> no way to know what encoding any particular non-ASCII octet >> string may be using, so the code was optimized to avoid spurious >> database string conversion errors, by using an encoding that >> would accept any octet-string, garbage-in -> garbage-out. > > Unless I misunderstand you (and I might well be doing so) I don't > think LATIN1 works that way in Postgres.
The point isn't really about Postgres per-se, but rather that all octet strings are valid encodings of some LATIN1 string, since every octet is a valid LATIN1 code-point. The same cannot be said of UTF8, since random octet strings are generally invalid. > If you're going > to get multibyte strings, I don't see how LATIN1 is any less likely to > throw errors than UTF8 We don't get "multi-byte" strings, absent SMTPUTF8 we get octet strings that are either ASCII, or something else unspecified that violates the SMTP protocol. The something else unspecified will be some valid ( intended or otherwise) LATIN1 string. Its use in database queries with a LATIN1 client encoding will not throw perplexing errors. > It's > true that the characters in that case will probably map, but they'll > fail anyway since the match could as easily be wrong as right). The sending system violated the SMTP protocol, garbage-in -> garbage out. The folks who tend to be lazy about encodings have generally been the ones with some flavour of a single-byte ISO-8859-X encoding, so we'll continue to treat unspecified 8-bit input as LATIN1. >> This means that we'd a way to dynamically update the client >> encoding of the database connection to UTF8 when appropriate >> and revert it LATIN1 when the client encoding is unspecified. > > You can do this in libpq and also in commands passed on a regular connection: > > SET CLIENT_ENCODING TO 'value'; Yes, but that'd have to be done by the dictionary lookup layer, possibly in proxymap, based on a suitable signal from the lookup client, but the low-level API (dict_get()) does not presently support any per-lookup flags. So we'd need dict_get_ex() that takes a new utf8 flag and supporting changes throughout the code. This is a major change. Perhaps there's a clever way to avoid this, but I am not seeing it yet. -- Viktor.