Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

Michael Paquier Tue, 02 Jun 2015 00:20:27 -0700

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <[email protected]> wrote:
> On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote:
>> [email protected] writes:
>> > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
>>
>> > The last explains why they do not consider it a bug:
>>
>> > ISO C99 requires for %.*s to only write complete characters that fit
below
>> > the
>> > precision number of bytes. If you are using say UTF-8 locale, but
ISO-8859-1
>> > characters as shown in the input file you provided, some of the
strings are
>> > not valid UTF-8 strings, therefore sprintf fails with -1 because of the
>> > encoding error. That's not a bug in glibc.
>>
>> Yeah, that was about the position I thought they'd take.
>
> GNU libc eventually revisited that conclusion and fixed the bug through
commit
> 715a900c9085907fa749589bf738b192b1a2bda5.  RHEL 7.1 is fixed, but RHEL
6.6 and
> RHEL 5.11 are still affected; the bug will be relevant for another 8+
years.
>
>> So the bottom line here is that we're best off to avoid %.*s because
>> it may fail if the string contains data that isn't validly encoded
>> according to libc's idea of the prevailing encoding.
>
> Yep.  Immediate precisions like %.10s trigger the bug as effectively as
%.*s,
> so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also
affected.
> Switching to strlcpy(), as attached, fixes the bug while simplifying the
code.
> The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"'
when
> the name of a file in the data directory is not a valid multibyte string.
>
>
> Commit 6dd9584 introduced a new use of .*s, to pg_upgrade.  It works
reliably
> for now, because it always runs in the C locale.  pg_upgrade never calls
> set_pglocale_pgservice() or otherwise sets its permanent locale.  It
would be
> natural for us to fix that someday, at which point non-ASCII database
names
> would perturb this status output.


I caught up the following places that need attention on top of the 4 ones
in tar.c:
src/backend/nodes/read.c:
    elog(ERROR, "unrecognized integer: \"%.*s\"",
src/backend/nodes/read.c:
    elog(ERROR, "unrecognized OID: \"%.*s\"",
src/backend/nodes/read.c:                               elog(ERROR,
"unrecognized token: \"%.*s\"", tok_len, token);
src/backend/nodes/readfuncs.c:          elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:          elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:                  elog(ERROR, "unrecognized
integer: \"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:          elog(ERROR, "unrecognized boolop
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c:          elog(ERROR, "badly formatted node
string \"%.32s\"...", token);
src/backend/tsearch/wparser_def.c:       * Use of %.*s here is a bit risky
since it can misbehave if the data is
src/backend/tsearch/wparser_def.c:      fprintf(stderr, "parsing
\"%.*s\"\n", len, str);
src/backend/tsearch/wparser_def.c:      /* See note above about %.*s */
src/backend/tsearch/wparser_def.c:      fprintf(stderr, "parsing copy of
\"%.*s\"\n", prs->lenstr, prs->str);
src/backend/utils/adt/datetime.c:                        * Note: the uses
of %.*s in this function would be risky if the
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:                       /* %.*s is safe
since all our tokens are ASCII */
src/backend/utils/adt/datetime.c:                       elog(LOG, "token
too long in %s table: \"%.*s\"",
src/interfaces/ecpg/ecpglib/error.c:    /* %.*s is safe here as long as
sqlstate is all-ASCII */
src/interfaces/ecpg/ecpglib/error.c:    ecpg_log("raising sqlstate %.*s
(sqlcode %ld): %s\n",
src/interfaces/ecpg/pgtypeslib/dt_common.c:                      * Note:
the uses of %.*s in this function would be risky if the
src/interfaces/ecpg/pgtypeslib/dt_common.c:
    sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
    sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
    sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/bin/pg_basebackup/pg_basebackup.c:
 ngettext("%*s/%s kB (%d%%), %d/%d tablespace (%s%-*.*s)",
src/bin/pg_basebackup/pg_basebackup.c:
              "%*s/%s kB (%d%%), %d/%d tablespaces (%s%-*.*s)",
src/bin/pg_upgrade/util.c:                              printf("
 %s%-*.*s\r",

> It would be good to purge the code of precisions on "s" conversion
specifiers,
> then Assert(!pointflag) in fmtstr() to catch new introductions.  I won't
plan
> to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing backend
code relying in sprintf's port version we should only do the assertion in
HEAD in my opinion, and mention it in the release notes of the next major
version at the time a patch in this area is applied. I guess that we had
better backpatch the places using .*s though on back-branches.
-- 
Michael

Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

Reply via email to