[moving to pgsql-hackers; this isn't the simple bug I initially suspected it might be]
On 20/09/10 03:10, Tom Lane wrote: > Craig Ringer <cr...@postnewspapers.com.au> writes: >> One of the correctly encoded messages is "Unexpected EOF received on >> client connection" > >> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown >> request received". Another is "Aborting any active transactions". > >> ... question now is where the messages are converted from UTF-8 to shift-JIS >> and why that conversion is being applied inconsistently. > > Given those three examples, I wonder whether all the mis-encoded > messages are emitted by the postmaster, rather than backends. > Anyway it seems that you ought to look for some pattern in which > messages are correctly vs incorrectly encoded. I think you're right. Looking into it more, though, I'm not even sure what the correct behaviour even is. I don't think this is a simple bug where Pg fails to convert between encodings in a few places; rather, it's a design oversight where the effect of having a system encoding different from the encoding of the database(s) isn't considered. A single log file should obviously be in a single encoding, it's the only sane way to do things. But which encoding is it in? And which *should* it be in? - The system text encoding? This is what the postmaster will have from its environment, and is what the user will expect the logs to be in. Postmaster will emit messages in this encoding at least during startup, as it doesn't know what encoding the cluster uses yet. (In fact it seems to stick to the system encoding throughout its life). - The default database encoding supplied to initdb during cluster creation? - The encoding of the database emitting a message? This makes sense when considering RAISE messages, for example. Backends will currently use this encoding when emitting log messages, whether user-supplied or translated from po files. This confusion leads to the mixed encoding issues reported by the OP. It's not a simple bug, it's a design issue. Unfortunately, it's not as simple as picking one of the above encodings for all logging. The system encoding isn't a good choice, because it might not be capable of representing all characters emitted by user RAISE statements in databases with a different encoding, nor all "double quoted" identifiers, parameter values, etc etc etc. For example, if the system encoding is SHIFT-JIS, but user databases emit messages with Chinese, Cyrillic, extended latin, or pretty much any non-Japanese characters, there's no sane way to convert messages containing any user text to shift-JIS for logging. The same applies with a latin-1 (iso-8859-1) system encoding and a utf-8 or shift-jis database emitting Japanese messages. Scratch using the system encoding for logging. What about the encoding used by initdb to create the cluster? It seems sensible, but: - The postmaster doesn't know what it is when it's doing it's initial startup. How can the postmaster complain that it can't find / open the cluster datadir when it doesn't know what encoding to use for the complaint? - If the cluster isn't created as utf-8, the same issue as with the system encoding applies. Using the encoding of the emitting database will permit all messages to be represented, but will give rise to mixed encodings in the log file, and still won't help the postmaster know what to do before it's found and read the cluster. I'm now inclined to propose that all logging be done unconditionally in utf-8, with a BOM written to the start of every log file. Backends with non-utf-8 databases should convert messages to utf-8 for logging. Because PostgreSQL supports the use of different encodings in different databases this is the only way to ensure sane logging with consistent encoding in a single log file. The only alternative I see is to break logging out into separate files: - postmaster.log for postmaster etc - [databasename].log for each database, in that database's encoding ... but I'm not confident that'd be worth the confusion. Neither scheme solves the question of what to do when logging to syslog, though. Syslog expects messages in the system encoding, and Pg would be wrong to log in any other encoding. Yet as databases may have characters that cannot be represented in the system encoding, the system encoding isn't good enough. Should syslog messages be converted to the system encoding with non-representable characters replaced by "?" or some other placeholder? Blech. Ideas? Suggestions? -- Craig Ringer Tech-related writing: http://soapyfrogs.blogspot.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers