On 7/15/22 14:19, Andres Freund wrote: > On 2022-07-15 14:01:53 -0700, Jacob Champion wrote: >> On 7/15/22 13:35, Andres Freund wrote: >>> I don't know, but I don't think not caring at all is a good >>> option. Particularly for unauthenticated data I'd say that escaping >>> everything >>> but printable ascii chars is a sensible approach. >> >> It'll also be painful for anyone whose infrastructure isn't in a Latin >> character set... Maybe that's worth the tradeoff for a v1. > > I don't think it's a huge issue, or really avoidable, pre-authentication. > Don't we require all server-side encodings to be supersets of ascii?
Well, I was going to say that for this feature, where the goal is to debug a failed certificate chain, having to manually unescape the logged certificate names if your infrastructure already handled UTF-8 natively would be a real pain. But your later point about truncation makes that moot; I forgot that my patch can truncate in the middle of a UTF-8 sequence, so you're probably dealing with replacement glyphs anyway. I don't really have a leg to stand on there. > We already have pg_clean_ascii() and use it for application_name, fwiw. That seems much worse than escaping for this particular patch; if your cert's Common Name is in (non-ASCII) UTF-8 then all you'll see is "CN=?????????" in the log lines that were supposed to be helping you root-cause. Escaping would be much more helpful in this case. >> Is there an acceptable approach that could centralize it, so we fix it >> once and are done? E.g. a log_encoding GUC and either conversion or >> escaping in send_message_to_server_log()? > > Introducing escaping to ascii for all log messages seems like it'd be > incredibly invasive, and would remove a lot of worthwhile information. Nor > does it really address the whole scope - consider e.g. the truncation in this > patch, that can't be done correctly by the time send_message_to_server_log() > is reached - just chopping in the middle of a multi-byte string would have > made the string invalidly encoded. And we can't perform encoding conversion > from client data until we've gone further into the authentication process, I > think. > > Always escaping ANSI escape codes (or rather the non-printable ascii range) is > more convincing. Then we'd just need to make sure that client controlled data > is properly encoded before handing it over to other parts of the system. > > I can see a point in a log_encoding GUC at some point, but it seems a bit > separate from the discussion here. Fair enough. --Jacob