On Sat, May 11, 2019 at 10:43:59PM -0400, Tom Lane wrote: > Noah Misch <n...@leadboat.com> writes: > > Pushed. This broke 010_dump_connstr.pl on bowerbird, introducing 'invalid > > byte sequence for encoding "UTF8"' errors. That's because log_connections > > renders this 010_dump_connstr.pl solution insufficient: > > Ugh. > > > 4. If GetMessageEncoding()==PG_SQL_ASCII, make pgwin32_message_to_UTF16() > > return NULL. The caller will always send untranslated bytes to write() > > or > > ReportEventA(). This seems consistent with the SQL_ASCII concept and > > with > > pg_do_encoding_conversion()'s interpretation of SQL_ASCII. > > > 5. When including a datname or rolname value in a message, hex-escape > > non-ASCII bytes. They are byte sequences, not text of known encoding. > > This preserves the most information, but it's overkill and ugly in the > > probably-common case of one encoding across all databases of a cluster. > > > I'm inclined to do (1) in back branches and (4) in HEAD only. (If starting > > fresh today, I would store the encoding of each rolname and dbname or just > > use > > UTF8 for those particular fields.) Other preferences? > > I agree that (4) is a fairly reasonable thing to do, and wouldn't mind > back-patching that.
Okay. Absent objections, I'll just do it that way. > Taking a wider view, this seems closely related > to something I've been thinking about in connection with the recent > pg_stat_activity contretemps: that mechanism is also shoving strings > across database boundaries without a lot of worry about encodings. > Maybe we should try to develop a common solution. > > One difference from the datname/rolname situation is that for > pg_stat_activity we can know the source encoding --- we aren't storing > it now, but we easily could. If we're thinking of a future solution > only, adding a "name encoding" field to relevant shared catalogs makes > sense perhaps. Alternatively, requiring names in shared catalogs to be > UTF8 might be a reasonable answer too. > > In all these cases, throwing an error when we can't translate a character > into the destination encoding is not very pleasant. For pg_stat_activity, > I was imagining that translating such characters to '?' might be the best > answer. I don't know if we can get away with that for the datname/rolname > case --- at the very least, it opens problems with apparent duplication of > names that should be unique. I don't much like your hex-encoding answer, > though; that has its own uniqueness-violation hazards, plus it's ugly. Another case of byte sequence masquerading as text is pg_settings.setting. In most contexts, it's important to convey exact values. Error messages can use '?'. I wouldn't let dump/reload of a rolname corrupt it that way, and I wouldn't recognize the '?' version for authentication. While pg_stat_activity.query could use '?', I'd encourage adding bytea and encoding columns for exact transmission. pg_stat_activity can't standardize on UTF8 without shrinking the set of valid queries or inaccurately reporting some, neither of which is attractive. datname/rolname could afford to be more prescriptive, since non-ASCII names are full of bugs today. A useful consequence of UTF8 datname/rolname would be today's "pg_dumpall --globals" remaining simple. If we were to support arbitrary encodings with a "name encoding" field, the general-case equivalent of "pg_dumpall --globals" would connect to several databases of different encodings in order to dump all objects, perhaps even creating a temporary database if no suitable-encoding database existed. MULE_INTERNAL presents trouble since we don't have a UTF8<->MULE_INTERNAL conversion. If we standardized cross-database strings on UTF8, it would be impossible to read such strings, create roles, etc. from a MULE_INTERNAL database. I suppose we'd either add the conversion or deprecate MULE_INTERNAL, forbidding its use as the initdb encoding.