The security team has a question below about how best to proceed with a recent
behavior change.

Commit 5dc1e42b4fa6a4434afa7d7cdcf0291351a7b873 for this week's CVE-2025-1094
changed how PQescapeString()[1] reacts to input that is not valid in the
client encoding.  Before that commit, the function would ignore encoding
problems except at the end of the string.  Now, it replaces the bad sequence
up to the length implied by the first byte.  For example, if UTF8 input has
0xc2 followed by an ASCII byte, the function removes both bytes.

Jeff Davis reported to the security team that
http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences forbids something
like this, saying a UTF-8 converter "must not consume the [second byte] if it
continues".  (That's my summary, not his.  He might reply to add color here.)
While PQescapeString() is not a UTF-8 converter, that standard still felt to
multiple security team members like a decent argument for removing only the
invalid 0xc2, not the byte following it.  UTF8 is the most important encoding,
and other encodings tolerate this.  Either way, the server will report an
encoding error.  The difference doesn't have functional consequences if one
simply puts the function result in a query.  The difference could matter for
debugging or if applications are postprocessing the PQescapeString() result in
some way.  Postprocessing is not supported, but we'd still like to do the best
thing for applications that may already be doing it.

Security team members disagreed on whether next week's releases are the last
reasonable chance to change this, or whether changing it in e.g. May would be
reasonable.  If applications make changes to cope with the new behavior, that
could be an argument against further change.

Question for all: would you switch to the "remove fewer bytes" behavior in
next week's releases, switch later, or switch never?  Why so?  Please answer
in the next 24hr if possible, given the little time until we wrap next week's
releases on Monday.  I regret the late notice.

I'm attaching a WIP patch from Andres Freund.  We may use it to adopt the
"remove fewer bytes" behavior, if that's the decision.

Thanks,
nm


[1] The commit changed other functions, but PQescapeString() is most
interesting for this discussion.  Others have ways to report errors, or they
have reason to believe the input is already checked.  New code should be using
the others and checking the error indicator.
>From 6f0b93afb6a4ba1157482e674e71f56cd9c555c9 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 14 Feb 2025 18:31:15 -0500
Subject: [PATCH v3 2/2] Have escape functions process bytes after invalid
 multi-byte char

Reviewed-by: Jeff Davis <pg...@j-davis.com>
Backpatch: 13
---
 src/fe_utils/string_utils.c    | 40 ++++++++++++++++++----------------
 src/interfaces/libpq/fe-exec.c | 17 ++++++++-------
 2 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index b6a7b197087..8621856fbc1 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -206,14 +206,13 @@ fmtIdEnc(const char *rawid, int encoding)
                                 * "skip" over quote characters, e.g. when 
parsing
                                 * character-by-character.
                                 *
-                                * Replace the bytes corresponding to the 
invalid character
-                                * with an invalid sequence, for the same 
reason as above.
+                                * Replace the current byte with with an 
invalid sequence, for the
+                                * same reason as above.
                                 *
-                                * It would be a bit faster to verify the whole 
string the
-                                * first time we encounter a set highbit, but 
this way we can
-                                * replace just the invalid characters, which 
probably makes
-                                * it easier for users to find the invalidly 
encoded portion
-                                * of a larger string.
+                                * It would be a bit faster to verify the whole 
string the first
+                                * time we encounter a set highbit, but this 
way we can replace
+                                * just the invalid byte, which probably makes 
it easier for users
+                                * to find the invalidly encoded portion of a 
larger string.
                                 */
                                enlargePQExpBuffer(id_return, 2);
                                pg_encoding_set_invalid(encoding,
@@ -222,11 +221,13 @@ fmtIdEnc(const char *rawid, int encoding)
                                id_return->data[id_return->len] = '\0';
 
                                /*
-                                * Copy the rest of the string after the 
invalid multi-byte
-                                * character.
+                                * Handle the following bytes as if this byte 
didn't exist,
+                                * that's safer in case the subsequent bytes 
contain
+                                * characters that are significant for the 
caller (e.g. '>' in
+                                * html).
                                 */
-                               remaining -= charlen;
-                               cp += charlen;
+                               remaining -= 1;
+                               cp += 1;
                        }
                        else
                        {
@@ -421,23 +422,24 @@ appendStringLiteral(PQExpBuffer buf, const char *str,
                         * over quote characters, e.g. when parsing
                         * character-by-character.
                         *
-                        * Replace the bytes corresponding to the invalid 
character with
-                        * an invalid sequence, for the same reason as above.
+                        * Replace the current byte with with an invalid 
sequence, for the
+                        * same reason as above.
                         *
                         * It would be a bit faster to verify the whole string 
the first
                         * time we encounter a set highbit, but this way we can 
replace
-                        * just the invalid characters, which probably makes it 
easier for
-                        * users to find the invalidly encoded portion of a 
larger string.
+                        * just the invalid byte, which probably makes it 
easier for users
+                        * to find the invalidly encoded portion of a larger 
string.
                         */
                        pg_encoding_set_invalid(encoding, target);
                        target += 2;
-                       remaining -= charlen;
 
                        /*
-                        * Copy the rest of the string after the invalid 
multi-byte
-                        * character.
+                        * Handle the following bytes as if this byte didn't 
exist, that's
+                        * safer in case the subsequent bytes contain important 
characters
+                        * for the caller (e.g. '>' in html).
                         */
-                       source += charlen;
+                       remaining -= 1;
+                       source += 1;
                }
                else
                {
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 120d4d032ec..53b906f9562 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4139,13 +4139,13 @@ PQescapeStringInternal(PGconn *conn,
                         * over quote characters, e.g. when parsing
                         * character-by-character.
                         *
-                        * Replace the bytes corresponding to the invalid 
character with
-                        * an invalid sequence, for the same reason as above.
+                        * Replace the current byte with with an invalid 
sequence, for the
+                        * same reason as above.
                         *
                         * It would be a bit faster to verify the whole string 
the first
                         * time we encounter a set highbit, but this way we can 
replace
-                        * just the invalid characters, which probably makes it 
easier for
-                        * users to find the invalidly encoded portion of a 
larger string.
+                        * just the invalid byte, which probably makes it 
easier for users
+                        * to find the invalidly encoded portion of a larger 
string.
                         */
                        if (error)
                                *error = 1;
@@ -4154,13 +4154,14 @@ PQescapeStringInternal(PGconn *conn,
 
                        pg_encoding_set_invalid(encoding, target);
                        target += 2;
-                       remaining -= charlen;
 
                        /*
-                        * Copy the rest of the string after the invalid 
multi-byte
-                        * character.
+                        * Handle the following bytes as if this byte didn't 
exist, that's
+                        * safer in case the subsequent bytes contain important 
characters
+                        * for the caller (e.g. '>' in html).
                         */
-                       source += charlen;
+                       remaining -= 1;
+                       source += 1;
                }
                else
                {
-- 
2.48.1.76.g4e746b1a31.dirty

Reply via email to