Unicode escapes with any backend encoding

Tom Lane Mon, 13 Jan 2020 15:32:59 -0800

I threatened to do this in another thread [1], so here it is.

This patch removes the restriction that the server encoding must
be UTF-8 in order to write any Unicode escape with a value outside
the ASCII range.  Instead, we'll allow the notation and convert to
the server encoding if that's possible.  (If it isn't, of course
you get an encoding conversion failure.)


In the cases that were already supported, namely ASCII characters
or UTF-8 server encoding, this should be only immeasurably slower
than before.  Otherwise, it calls the appropriate encoding conversion
procedure, which of course will take a little time.  But that's
better than failing, surely.

One way in which this is slightly less good than before is that
you no longer get a syntax error cursor pointing at the problematic
escape when conversion fails.  If we were really excited about that,
something could be done with setting up an errcontext stack entry.
But that would add a few cycles, so I wasn't sure whether to do it.

Grepping for other direct uses of unicode_to_utf8(), I notice that
there are a couple of places in the JSON code where we have a similar
restriction that you can only write a Unicode escape in UTF8 server
encoding.  I'm not sure whether these same semantics could be
applied there, so I didn't touch that.

Thoughts?

                        regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw%40mail.gmail.com

diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index c908e0b..e134877 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5;
     ampersands.  The length limitation still applies.
    </para>
 
+   <para>
+    Quoting an identifier also makes it case-sensitive, whereas
+    unquoted names are always folded to lower case.  For example, the
+    identifiers <literal>FOO</literal>, <literal>foo</literal>, and
+    <literal>"foo"</literal> are considered the same by
+    <productname>PostgreSQL</productname>, but
+    <literal>"Foo"</literal> and <literal>"FOO"</literal> are
+    different from these three and each other.  (The folding of
+    unquoted names to lower case in <productname>PostgreSQL</productname> is
+    incompatible with the SQL standard, which says that unquoted names
+    should be folded to upper case.  Thus, <literal>foo</literal>
+    should be equivalent to <literal>"FOO"</literal> not
+    <literal>"foo"</literal> according to the standard.  If you want
+    to write portable applications you are advised to always quote a
+    particular name or never quote it.)
+   </para>
+
    <indexterm>
      <primary>Unicode escape</primary>
      <secondary>in identifiers</secondary>
@@ -230,7 +247,8 @@ U&amp;"d!0061t!+000061" UESCAPE '!'
     The escape character can be any single character other than a
     hexadecimal digit, the plus sign, a single quote, a double quote,
     or a whitespace character.  Note that the escape character is
-    written in single quotes, not double quotes.
+    written in single quotes, not double quotes,
+    after <literal>UESCAPE</literal>.
    </para>
 
    <para>
@@ -239,32 +257,18 @@ U&amp;"d!0061t!+000061" UESCAPE '!'
    </para>
 
    <para>
-    The Unicode escape syntax works only when the server encoding is
-    <literal>UTF8</literal>.  When other server encodings are used, only code
-    points in the ASCII range (up to <literal>\007F</literal>) can be
-    specified.  Both the 4-digit and the 6-digit form can be used to
+    Either the 4-digit or the 6-digit escape form can be used to
     specify UTF-16 surrogate pairs to compose characters with code
     points larger than U+FFFF, although the availability of the
     6-digit form technically makes this unnecessary.  (Surrogate
-    pairs are not stored directly, but combined into a single
-    code point that is then encoded in UTF-8.)
+    pairs are not stored directly, but are combined into a single
+    code point.)
    </para>
 
    <para>
-    Quoting an identifier also makes it case-sensitive, whereas
-    unquoted names are always folded to lower case.  For example, the
-    identifiers <literal>FOO</literal>, <literal>foo</literal>, and
-    <literal>"foo"</literal> are considered the same by
-    <productname>PostgreSQL</productname>, but
-    <literal>"Foo"</literal> and <literal>"FOO"</literal> are
-    different from these three and each other.  (The folding of
-    unquoted names to lower case in <productname>PostgreSQL</productname> is
-    incompatible with the SQL standard, which says that unquoted names
-    should be folded to upper case.  Thus, <literal>foo</literal>
-    should be equivalent to <literal>"FOO"</literal> not
-    <literal>"foo"</literal> according to the standard.  If you want
-    to write portable applications you are advised to always quote a
-    particular name or never quote it.)
+    If the server encoding is not UTF-8, the Unicode code point identified
+    by one of these escape sequences is converted to the actual server
+    encoding; an error is reported if that's not possible.
    </para>
   </sect2>
 
@@ -427,25 +431,11 @@ SELECT 'foo'      'bar';
     <para>
      It is your responsibility that the byte sequences you create,
      especially when using the octal or hexadecimal escapes, compose
-     valid characters in the server character set encoding.  When the
-     server encoding is UTF-8, then the Unicode escapes or the
+     valid characters in the server character set encoding.
+     A useful alternative is to use Unicode escapes or the
      alternative Unicode escape syntax, explained
-     in <xref linkend="sql-syntax-strings-uescape"/>, should be used
-     instead.  (The alternative would be doing the UTF-8 encoding by
-     hand and writing out the bytes, which would be very cumbersome.)
-    </para>
-
-    <para>
-     The Unicode escape syntax works fully only when the server
-     encoding is <literal>UTF8</literal>.  When other server encodings are
-     used, only code points in the ASCII range (up
-     to <literal>\u007F</literal>) can be specified.  Both the 4-digit and
-     the 8-digit form can be used to specify UTF-16 surrogate pairs to
-     compose characters with code points larger than U+FFFF, although
-     the availability of the 8-digit form technically makes this
-     unnecessary.  (When surrogate pairs are used when the server
-     encoding is <literal>UTF8</literal>, they are first combined into a
-     single code point that is then encoded in UTF-8.)
+     in <xref linkend="sql-syntax-strings-uescape"/>; then the server
+     will check that the character conversion is possible.
     </para>
 
     <caution>
@@ -524,16 +514,23 @@ U&amp;'d!0061t!+000061' UESCAPE '!'
     </para>
 
     <para>
-     The Unicode escape syntax works only when the server encoding is
-     <literal>UTF8</literal>.  When other server encodings are used, only
-     code points in the ASCII range (up to <literal>\007F</literal>)
-     can be specified.  Both the 4-digit and the 6-digit form can be
-     used to specify UTF-16 surrogate pairs to compose characters with
-     code points larger than U+FFFF, although the availability of the
-     6-digit form technically makes this unnecessary.  (When surrogate
-     pairs are used when the server encoding is <literal>UTF8</literal>, they
-     are first combined into a single code point that is then encoded
-     in UTF-8.)
+     To include the escape character in the string literally, write
+     it twice.
+    </para>
+
+    <para>
+     Either the 4-digit or the 6-digit escape form can be used to
+     specify UTF-16 surrogate pairs to compose characters with code
+     points larger than U+FFFF, although the availability of the
+     6-digit form technically makes this unnecessary.  (Surrogate
+     pairs are not stored directly, but are combined into a single
+     code point.)
+    </para>
+
+    <para>
+     If the server encoding is not UTF-8, the Unicode code point identified
+     by one of these escape sequences is converted to the actual server
+     encoding; an error is reported if that's not possible.
     </para>
 
     <para>
@@ -546,11 +543,6 @@ U&amp;'d!0061t!+000061' UESCAPE '!'
      parameter is set to off, this syntax will be rejected with an
      error message.
     </para>
-
-    <para>
-     To include the escape character in the string literally, write it
-     twice.
-    </para>
    </sect3>
 
    <sect3 id="sql-syntax-dollar-quoting">
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 1bf1144..e88a5e0 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -292,7 +292,7 @@ hexval(unsigned char c)
 	return 0;					/* not reached */
 }
 
-/* is Unicode code point acceptable in database's encoding? */
+/* is Unicode code point acceptable? */
 static void
 check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
 {
@@ -302,12 +302,6 @@ check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
 				(errcode(ERRCODE_SYNTAX_ERROR),
 				 errmsg("invalid Unicode escape value"),
 				 scanner_errposition(pos, yyscanner)));
-
-	if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
-		ereport(ERROR,
-				(errcode(ERRCODE_SYNTAX_ERROR),
-				 errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
-				 scanner_errposition(pos, yyscanner)));
 }
 
 /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
@@ -338,18 +332,30 @@ str_udeescape(const char *str, char escape,
 	const char *in;
 	char	   *new,
 			   *out;
+	size_t		new_len;
 	pg_wchar	pair_first = 0;
 
 	/*
-	 * This relies on the subtle assumption that a UTF-8 expansion cannot be
-	 * longer than its escaped representation.
+	 * Guesstimate that result will be no longer than input, but allow enough
+	 * padding for Unicode conversion.
 	 */
-	new = palloc(strlen(str) + 1);
+	new_len = strlen(str) + MAX_UNICODE_EQUIVALENT_STRING + 1;
+	new = palloc(new_len);
 
 	in = str;
 	out = new;
 	while (*in)
 	{
+		/* Enlarge string if needed */
+		size_t		out_dist = out - new;
+
+		if (out_dist > new_len - (MAX_UNICODE_EQUIVALENT_STRING + 1))
+		{
+			new_len *= 2;
+			new = repalloc(new, new_len);
+			out = new + out_dist;
+		}
+
 		if (in[0] == escape)
 		{
 			if (in[1] == escape)
@@ -390,8 +396,8 @@ str_udeescape(const char *str, char escape,
 					pair_first = unicode;
 				else
 				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
+					pg_unicode_to_server(unicode, (unsigned char *) out);
+					out += strlen(out);
 				}
 				in += 5;
 			}
@@ -431,8 +437,8 @@ str_udeescape(const char *str, char escape,
 					pair_first = unicode;
 				else
 				{
-					unicode_to_utf8(unicode, (unsigned char *) out);
-					out += pg_mblen(out);
+					pg_unicode_to_server(unicode, (unsigned char *) out);
+					out += strlen(out);
 				}
 				in += 8;
 			}
@@ -457,13 +463,6 @@ str_udeescape(const char *str, char escape,
 		goto invalid_pair;
 
 	*out = '\0';
-
-	/*
-	 * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-	 * codes; but it's probably not worth the trouble, since this isn't likely
-	 * to be a performance-critical path.
-	 */
-	pg_verifymbstr(new, out - new, false);
 	return new;
 
 invalid_pair:
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 84c7391..3903df8 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -1226,19 +1226,18 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
-	char		buf[8];
+	char		buf[MAX_UNICODE_EQUIVALENT_STRING + 1];
 
 	/* See also check_unicode_value() in parser.c */
 	if (c == 0 || c > 0x10FFFF)
 		yyerror("invalid Unicode escape value");
-	if (c > 0x7F)
-	{
-		if (GetDatabaseEncoding() != PG_UTF8)
-			yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
-		yyextra->saw_non_ascii = true;
-	}
-	unicode_to_utf8(c, (unsigned char *) buf);
-	addlit(buf, pg_mblen(buf), yyscanner);
+
+	/*
+	 * We expect that pg_unicode_to_server() will complain about any
+	 * unconvertible code point, so we don't have to set saw_non_ascii.
+	 */
+	pg_unicode_to_server(c, (unsigned char *) buf);
+	addlit(buf, strlen(buf), yyscanner);
 }
 
 static unsigned char
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 3808c30..a2d2a0b 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -2086,26 +2086,6 @@ map_sql_identifier_to_xml_name(const char *ident, bool fully_escaped,
 
 
 /*
- * Map a Unicode codepoint into the current server encoding.
- */
-static char *
-unicode_to_sqlchar(pg_wchar c)
-{
-	char		utf8string[8];	/* need room for trailing zero */
-	char	   *result;
-
-	memset(utf8string, 0, sizeof(utf8string));
-	unicode_to_utf8(c, (unsigned char *) utf8string);
-
-	result = pg_any_to_server(utf8string, strlen(utf8string), PG_UTF8);
-	/* if pg_any_to_server didn't strdup, we must */
-	if (result == utf8string)
-		result = pstrdup(result);
-	return result;
-}
-
-
-/*
  * Map XML name to SQL identifier; see SQL/XML:2008 section 9.3.
  */
 char *
@@ -2125,10 +2105,12 @@ map_xml_name_to_sql_identifier(const char *name)
 			&& isxdigit((unsigned char) *(p + 5))
 			&& *(p + 6) == '_')
 		{
+			char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
 			unsigned int u;
 
 			sscanf(p + 2, "%X", &u);
-			appendStringInfoString(&buf, unicode_to_sqlchar(u));
+			pg_unicode_to_server(u, (unsigned char *) cbuf);
+			appendStringInfoString(&buf, cbuf);
 			p += 6;
 		}
 		else
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 5d7cc74..7d90ac9 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -68,6 +68,13 @@ static FmgrInfo *ToServerConvProc = NULL;
 static FmgrInfo *ToClientConvProc = NULL;
 
 /*
+ * This variable stores the conversion function to convert from UTF-8
+ * to the server encoding.  It's NULL if the server encoding *is* UTF-8,
+ * or if we lack a conversion function for this.
+ */
+static FmgrInfo *Utf8ToServerConvProc = NULL;
+
+/*
  * These variables track the currently-selected encodings.
  */
 static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII];
@@ -273,6 +280,8 @@ SetClientEncoding(int encoding)
 void
 InitializeClientEncoding(void)
 {
+	int			current_server_encoding;
+
 	Assert(!backend_startup_complete);
 	backend_startup_complete = true;
 
@@ -289,6 +298,35 @@ InitializeClientEncoding(void)
 						pg_enc2name_tbl[pending_client_encoding].name,
 						GetDatabaseEncodingName())));
 	}
+
+	/*
+	 * Also look up the UTF8-to-server conversion function if needed.  Since
+	 * the server encoding is fixed within any one backend process, we don't
+	 * have to do this more than once.
+	 */
+	current_server_encoding = GetDatabaseEncoding();
+	if (current_server_encoding != PG_UTF8 &&
+		current_server_encoding != PG_SQL_ASCII)
+	{
+		Oid			utf8_to_server_proc;
+
+		Assert(IsTransactionState());
+		utf8_to_server_proc =
+			FindDefaultConversionProc(PG_UTF8,
+									  current_server_encoding);
+		/* If there's no such conversion, just leave the pointer as NULL */
+		if (OidIsValid(utf8_to_server_proc))
+		{
+			FmgrInfo   *finfo;
+
+			finfo = (FmgrInfo *) MemoryContextAlloc(TopMemoryContext,
+													sizeof(FmgrInfo));
+			fmgr_info_cxt(utf8_to_server_proc, finfo,
+						  TopMemoryContext);
+			/* Set Utf8ToServerConvProc only after data is fully valid */
+			Utf8ToServerConvProc = finfo;
+		}
+	}
 }
 
 /*
@@ -752,6 +790,73 @@ perform_default_encoding_conversion(const char *src, int len,
 	return result;
 }
 
+/*
+ * Convert a single Unicode code point into a string in the server encoding.
+ *
+ * The code point given by "c" is converted and stored at *s, which must
+ * have at least MAX_UNICODE_EQUIVALENT_STRING+1 bytes available.
+ * The output will have a trailing '\0'.  Throws error if the conversion
+ * cannot be performed.
+ *
+ * Note that this relies on having previously looked up any required
+ * conversion function.  That's partly for speed but mostly because the parser
+ * may call this outside any transaction, or in an aborted transaction.
+ */
+void
+pg_unicode_to_server(pg_wchar c, unsigned char *s)
+{
+	unsigned char c_as_utf8[MAX_MULTIBYTE_CHAR_LEN + 1];
+	int			c_as_utf8_len;
+	int			server_encoding;
+
+	/*
+	 * Complain if invalid Unicode code point.  The choice of errcode here is
+	 * debatable, but really our caller should have checked this anyway.
+	 */
+	if (c == 0 || c > 0x10FFFF)
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode code point")));
+
+	/* Otherwise, if it's in ASCII range, conversion is trivial */
+	if (c <= 0x7F)
+	{
+		s[0] = (unsigned char) c;
+		s[1] = '\0';
+		return;
+	}
+
+	/* If the server encoding is UTF-8, we just need to reformat the code */
+	server_encoding = GetDatabaseEncoding();
+	if (server_encoding == PG_UTF8)
+	{
+		unicode_to_utf8(c, s);
+		s[pg_utf_mblen(s)] = '\0';
+		return;
+	}
+
+	/* For all other cases, we must have a conversion function available */
+	if (Utf8ToServerConvProc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("conversion between %s and %s is not supported",
+						pg_enc2name_tbl[PG_UTF8].name,
+						GetDatabaseEncodingName())));
+
+	/* Construct UTF-8 source string */
+	unicode_to_utf8(c, c_as_utf8);
+	c_as_utf8_len = pg_utf_mblen(c_as_utf8);
+	c_as_utf8[c_as_utf8_len] = '\0';
+
+	/* Convert, or throw error if we can't */
+	FunctionCall5(Utf8ToServerConvProc,
+				  Int32GetDatum(PG_UTF8),
+				  Int32GetDatum(server_encoding),
+				  CStringGetDatum(c_as_utf8),
+				  CStringGetDatum(s),
+				  Int32GetDatum(c_as_utf8_len));
+}
+
 
 /* convert a multibyte string to a wchar */
 int
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 7fb5fa4..2daf301 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -316,6 +316,15 @@ typedef enum pg_enc
 #define MAX_CONVERSION_GROWTH  4
 
 /*
+ * Maximum byte length of the string equivalent to any one Unicode code point,
+ * in any backend encoding.  The current value assumes that a 4-byte UTF-8
+ * character might expand by MAX_CONVERSION_GROWTH, which is a huge
+ * overestimate.  But in current usage we don't allocate large multiples of
+ * this, so there's little point in being stingy.
+ */
+#define MAX_UNICODE_EQUIVALENT_STRING	16
+
+/*
  * Table for mapping an encoding number to official encoding name and
  * possibly other subsidiary data.  Be careful to check encoding number
  * before accessing a table entry!
@@ -602,6 +611,8 @@ extern char *pg_server_to_client(const char *s, int len);
 extern char *pg_any_to_server(const char *s, int len, int encoding);
 extern char *pg_server_to_any(const char *s, int len, int encoding);
 
+extern void pg_unicode_to_server(pg_wchar c, unsigned char *s);
+
 extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc);
 extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);

Unicode escapes with any backend encoding

Reply via email to