Re: [HACKERS] Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

Robert Haas Thu, 09 Jun 2011 10:56:01 -0700

On Thu, Jun 9, 2011 at 1:22 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Hmm ... while the above is easy enough to do in the backend, where we
>>> can look at pg_database_encoding_max_length, we have also got instances
>>> of this coding pattern in src/port/pgstrcasecmp.c.  It's a lot less
>>> obvious how to make the test in frontend environments.  Thoughts anyone?
>
>> I'm not sure if this helps at all, but an awful lot of those tests are
>> against hard-coded strings that are known to contain only ASCII
>> characters.  Is there some way we can optimize this for that case?
>
> For the places where we're just looking for a match to a fixed all-ASCII
> string, an ASCII-only downcasing would be sufficient, and would
> eliminate the whole problem.  But I doubt all the callers fall into that
> class.
>
> What I'm particularly worried about at the moment is whether we are
> assuming anywhere that the frontend side can duplicate the backend's
> identifier downcasing behavior.  That seems like a complete morass,
> because (1) they might not have the same locale, (2) they might not
> have the same encoding, (3) even if they do, the "same" locale is known
> to behave differently on different platforms.


Right.  Understood.  So let's look at the cases (from git grep
pg_strcasecmp and pg_strncasecmp):

contrib/dict_int: Fixed strings only, and it's all backend code anyway.
contrib/dict_xsyn: Fixed strings only, and it's all backend code anyway.
contrib/hstore: Fixed strings only, and it's all backend code anyway.
contrib/pg_upgrade: Used to compare LC_COLLATE, LC_CTYPE, and encoding names.
contrib/pgbench: Definitely front-end code, but it's all fixed strings.
contrib/pgcrypto: All fixed strings except for one instance in
px_find_digit.  But it's all backend
contrib/spi: One instance, not a fixed string, but it's backend code.
contrib/unaccent: One instance, not a fixed string, but it's backend code.
src/backend/*: Backend code, obviously.
src/bin/initdb: Strings from a constant lookup table
(tsearch_config_languages) only.
src/bin/pg_basebackup: Fixed strings only.
src/bin/pg_ctl: Fixed strings only.
src/bin/pg_dump: Fixed strings only.
src/bin/psql: Fixed strings only.  In a couple of cases they are not
constants - help.c uses strings from to generated file sql_help.h, and
tab-complete.c uses strings from a constant array called
words_after_create[].  But these are constant lookup tables.
src/include: access/reloptions.h uses strncasecmp() as part of a
macro.  That should be OK as long as no one tries to include this in
frontend code, which seems rather impractical.
src/interfaces/ecpg/ecpglib: Fixed strings.
src/interfaces/ecpg/pgtypeslib: Fixed strings, and strings from a
constant lookup table, only.
src/interfaces/ecpg/preproc: This looks a bit worrisome.  It seems we
might be using it on identifiers here.
src/interfaces/libpq: This is attempting to match a wildcard
certificate name against a hostname, in two different places.
src/port/chklocale.c: Fixed strings or ones from a lookup table.
src/timezone/pgtz.c: Matches input strings against filenames read from the OS.

So mostly I think these are OK.  The instance in
src/interfaces/ecpg/preproc looks like the most likely candidate for a
problem spot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

Reply via email to