Re: Pre-proposal: unicode normalized text

Jeff Davis Thu, 29 Feb 2024 17:03:12 -0800

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.


Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
        Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization

From 54a15ee4ac5d5f437f4d536d724e1fa9e535fd50 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Thu, 29 Feb 2024 13:13:58 -0800
Subject: [PATCH v1] CREATE DATABASE ... STRICT_UNICODE.

Introduce new per-database option STRICT_UNICODE, which causes
Postgres to reject any textual value containing unassigned code
points. (Surrogate halves were already rejected because they are
invalid for UTF-8.)

"Unassigned" means unassigned as of the version of Unicode that
Postgres is based on; that is, the version returned by the SQL
function unicode_version().

By rejecting unassigned code points, it helps stabilize the database
against semantic changes across Postgres versions resulting from
assignment of previously-unassigned code points. For instance, Unicode
normalization is only stable across Unicode versions when using
assigned code points.

New databases may use STRICT_UNICODE if the template also uses
STRICT_UNICODE, or if the template is template0. An existing database
may be altered to disable STRICT_UNICODE (and therefore allow
unassigned code points), but may not be altered to enable
STRICT_UNICODE (because existing values may contain unassigned code
points).

Discussion: https://postgr.es/m/f30b58657ceb71d5be032decf4058d454cc1df74.camel%40j-davis.com
---
 doc/src/sgml/ref/alter_database.sgml  | 33 ++++++++++++++
 doc/src/sgml/ref/create_database.sgml | 23 ++++++++++
 doc/src/sgml/ref/createdb.sgml        | 23 ++++++++++
 doc/src/sgml/ref/initdb.sgml          | 23 ++++++++++
 src/backend/commands/dbcommands.c     | 64 ++++++++++++++++++++++++---
 src/backend/utils/adt/oracle_compat.c | 16 +++++++
 src/backend/utils/adt/pg_locale.c     |  3 ++
 src/backend/utils/adt/varlena.c       | 35 +++++++++++++++
 src/backend/utils/init/postinit.c     |  2 +
 src/bin/initdb/initdb.c               | 21 +++++++++
 src/bin/pg_dump/pg_dump.c             | 12 +++++
 src/bin/psql/describe.c               | 11 +++++
 src/bin/scripts/createdb.c            | 15 +++++++
 src/include/catalog/pg_database.dat   |  1 +
 src/include/catalog/pg_database.h     |  3 ++
 src/include/utils/pg_locale.h         |  3 ++
 16 files changed, 281 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/alter_database.sgml b/doc/src/sgml/ref/alter_database.sgml
index 2479c41e8d..07e42dbdd4 100644
--- a/doc/src/sgml/ref/alter_database.sgml
+++ b/doc/src/sgml/ref/alter_database.sgml
@@ -25,6 +25,7 @@ ALTER DATABASE <replaceable class="parameter">name</replaceable> [ [ WITH ] <rep
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
 
+    STRICT_UNICODE <replaceable class="parameter">strict_unicode</replaceable>
     ALLOW_CONNECTIONS <replaceable class="parameter">allowconn</replaceable>
     CONNECTION LIMIT <replaceable class="parameter">connlimit</replaceable>
     IS_TEMPLATE <replaceable class="parameter">istemplate</replaceable>
@@ -112,6 +113,38 @@ ALTER DATABASE <replaceable class="parameter">name</replaceable> RESET ALL
       </listitem>
      </varlistentry>
 
+      <varlistentry>
+       <term><replaceable class="parameter">strict_unicode</replaceable></term>
+       <listitem>
+        <para>
+         If <literal>true</literal>, specifies that the initial databases will
+         reject Unicode code points that are unassigned as of the version of
+         Unicode returned by <function>unicode_version()</function> (See <xref
+         linkend="functions-version"/>). Only valid if the encoding is
+         <literal>UTF8</literal>.
+        </para>
+        <para>
+         This setting may be changed from <literal>true</literal> to
+         <literal>false</literal> to enable storing textual values containing
+         unassigned Unicode code points. However, this setting may not be
+         changed from <literal>false</literal> to <literal>true</literal>,
+         because existing textual values in the database might contain
+         unassigned Unicode code points. A changed setting is recognized in
+         new connections.
+        </para>
+        <note>
+         <para>
+          This option affects all textual fields in the initial databases, and
+          should only be used when the applications control the text
+          input. Furthermore, it may not be possible to use recently-assigned
+          code points if <productname>PostgreSQL</productname> is based on an
+          older version of Unicode that does not yet recognize the new
+          assignments.
+         </para>
+        </note>
+       </listitem>
+      </varlistentry>
+
       <varlistentry>
        <term><replaceable class="parameter">allowconn</replaceable></term>
        <listitem>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 72927960eb..c546789d28 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
     [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ]
            [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ]
            [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ]
+           [ STRICT_UNICODE [=] <replaceable class="parameter">strict_unicode</replaceable> ]
            [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ]
            [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ]
            [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ]
@@ -120,6 +121,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        </para>
       </listitem>
      </varlistentry>
+     <varlistentry id="create-database-strict-unicode">
+      <term><replaceable class="parameter">strict_unicode</replaceable></term>
+      <listitem>
+       <para>
+        If <literal>true</literal>, specifies that the initial databases will
+        reject Unicode code points that are unassigned as of the version of
+        Unicode returned by <function>unicode_version()</function> (See <xref
+        linkend="functions-version"/>). Only valid if the encoding is
+        <literal>UTF8</literal>.
+       </para>
+       <note>
+        <para>
+         This option affects all textual fields in the initial databases, and
+         should only be used when the applications control the text
+         input. Furthermore, it may not be possible to use recently-assigned
+         code points if <productname>PostgreSQL</productname> is based on an
+         older version of Unicode that does not yet recognize the new
+         assignments.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
      <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY">
       <term><replaceable class="parameter">strategy</replaceable></term>
       <listitem>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e4647d5ce7..d2b8014b59 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -118,6 +118,29 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--strict-unicode</option></term>
+      <listitem>
+       <para>
+        Specifies that the database will reject Unicode code points that are
+        unassigned as of the version of Unicode returned by
+        <function>unicode_version()</function> (See <xref
+        linkend="functions-version"/>). Only valid if the encoding is
+        <literal>UTF8</literal>.
+       </para>
+       <note>
+        <para>
+         This option affects all textual fields in the database, and should
+         only be used when the applications control the text
+         input. Furthermore, it may not be possible to use recently-assigned
+         code points if <productname>PostgreSQL</productname> is based on an
+         older version of Unicode that does not yet recognize the new
+         assignments.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>-l <replaceable class="parameter">locale</replaceable></option></term>
       <term><option>--locale=<replaceable class="parameter">locale</replaceable></option></term>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index cd75cae10e..4242aea278 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -227,6 +227,29 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-strict-unicode">
+      <term><option>--strict-unicode</option></term>
+      <listitem>
+       <para>
+        Specifies that the initial databases will reject Unicode code points
+        that are unassigned as of the version of Unicode returned by
+        <function>unicode_version()</function> (See <xref
+        linkend="functions-version"/>). Only valid if the encoding is
+        <literal>UTF8</literal>.
+       </para>
+       <note>
+        <para>
+         This option affects all textual fields in the initial databases, and
+         should only be used when the applications control the text
+         input. Furthermore, it may not be possible to use recently-assigned
+         code points if <productname>PostgreSQL</productname> is based on an
+         older version of Unicode that does not yet recognize the new
+         assignments.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-allow-group-access" xreflabel="group access">
       <term><option>-g</option></term>
       <term><option>--allow-group-access</option></term>
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index b1327de71e..9524d4447c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -116,7 +116,8 @@ static void movedb(const char *dbname, const char *tblspcname);
 static void movedb_failure_callback(int code, Datum arg);
 static bool get_db_info(const char *name, LOCKMODE lockmode,
 						Oid *dbIdP, Oid *ownerIdP,
-						int *encodingP, bool *dbIsTemplateP, bool *dbAllowConnP, bool *dbHasLoginEvtP,
+						int *encodingP, bool *dbstrictunicodeP, bool *dbIsTemplateP,
+						bool *dbAllowConnP, bool *dbHasLoginEvtP,
 						TransactionId *dbFrozenXidP, MultiXactId *dbMinMultiP,
 						Oid *dbTablespace, char **dbCollate, char **dbCtype, char **dbIculocale,
 						char **dbIcurules,
@@ -673,6 +674,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	Oid			src_dboid;
 	Oid			src_owner;
 	int			src_encoding = -1;
+	bool		src_strictunicode = false;
 	char	   *src_collate = NULL;
 	char	   *src_ctype = NULL;
 	char	   *src_iculocale = NULL;
@@ -697,6 +699,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	DefElem    *downer = NULL;
 	DefElem    *dtemplate = NULL;
 	DefElem    *dencoding = NULL;
+	DefElem	   *dstrictunicode = NULL;
 	DefElem    *dlocale = NULL;
 	DefElem    *dcollate = NULL;
 	DefElem    *dctype = NULL;
@@ -718,6 +721,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	char		dblocprovider = '\0';
 	char	   *canonname;
 	int			encoding = -1;
+	bool		dbstrictunicode = false;
 	bool		dbistemplate = false;
 	bool		dballowconnections = true;
 	int			dbconnlimit = DATCONNLIMIT_UNLIMITED;
@@ -756,6 +760,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 				errorConflictingDefElem(defel, pstate);
 			dencoding = defel;
 		}
+		else if (strcmp(defel->defname, "strict_unicode") == 0)
+		{
+			if (dstrictunicode)
+				errorConflictingDefElem(defel, pstate);
+			dstrictunicode = defel;
+		}
 		else if (strcmp(defel->defname, "locale") == 0)
 		{
 			if (dlocale)
@@ -893,6 +903,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 						 parser_errposition(pstate, dencoding->location)));
 		}
 	}
+	if (dstrictunicode)
+		dbstrictunicode = defGetBoolean(dstrictunicode);
 	if (dlocale && dlocale->arg)
 	{
 		dbcollate = defGetString(dlocale);
@@ -968,7 +980,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		dbtemplate = "template1";	/* Default template database name */
 
 	if (!get_db_info(dbtemplate, ShareLock,
-					 &src_dboid, &src_owner, &src_encoding,
+					 &src_dboid, &src_owner, &src_encoding, &src_strictunicode,
 					 &src_istemplate, &src_allowconn, &src_hasloginevt,
 					 &src_frozenxid, &src_minmxid, &src_deftablespace,
 					 &src_collate, &src_ctype, &src_iculocale, &src_icurules, &src_locprovider,
@@ -1021,6 +1033,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	/* If encoding or locales are defaulted, use source's setting */
 	if (encoding < 0)
 		encoding = src_encoding;
+	if (!dstrictunicode)
+		dbstrictunicode  = src_strictunicode;
 	if (dbcollate == NULL)
 		dbcollate = src_collate;
 	if (dbctype == NULL)
@@ -1057,6 +1071,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 				 errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
 	dbctype = canonname;
 
+	if (dbstrictunicode && encoding != PG_UTF8)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("encoding \"%s\" does not support STRICT_UNICODE",
+						pg_encoding_to_char(encoding))));
+
 	check_encoding_locale_matches(encoding, dbcollate, dbctype);
 
 	if (dblocprovider == COLLPROVIDER_ICU)
@@ -1131,6 +1151,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							pg_encoding_to_char(src_encoding)),
 					 errhint("Use the same encoding as in the template database, or use template0 as template.")));
 
+		if (dbstrictunicode && !src_strictunicode)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("STRICT_UNICODE is incompatible with the template database"),
+					 errhint("Use a template database with STRICT_UNICODE, or use template0 as template.")));
+
 		if (strcmp(dbcollate, src_collate) != 0)
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1373,6 +1399,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		DirectFunctionCall1(namein, CStringGetDatum(dbname));
 	new_record[Anum_pg_database_datdba - 1] = ObjectIdGetDatum(datdba);
 	new_record[Anum_pg_database_encoding - 1] = Int32GetDatum(encoding);
+	new_record[Anum_pg_database_datstrictunicode - 1] = BoolGetDatum(dbstrictunicode);
 	new_record[Anum_pg_database_datlocprovider - 1] = CharGetDatum(dblocprovider);
 	new_record[Anum_pg_database_datistemplate - 1] = BoolGetDatum(dbistemplate);
 	new_record[Anum_pg_database_datallowconn - 1] = BoolGetDatum(dballowconnections);
@@ -1604,7 +1631,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	pgdbrel = table_open(DatabaseRelationId, RowExclusiveLock);
 
-	if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL,
+	if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL,
 					 &db_istemplate, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL))
 	{
 		if (!missing_ok)
@@ -1819,7 +1846,7 @@ RenameDatabase(const char *oldname, const char *newname)
 	 */
 	rel = table_open(DatabaseRelationId, RowExclusiveLock);
 
-	if (!get_db_info(oldname, AccessExclusiveLock, &db_id, NULL, NULL, NULL,
+	if (!get_db_info(oldname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, NULL,
 					 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_DATABASE),
@@ -1929,7 +1956,7 @@ movedb(const char *dbname, const char *tblspcname)
 	 */
 	pgdbrel = table_open(DatabaseRelationId, RowExclusiveLock);
 
-	if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL,
+	if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, NULL,
 					 NULL, NULL, NULL, NULL, &src_tblspcoid, NULL, NULL, NULL, NULL, NULL, NULL))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_DATABASE),
@@ -2274,9 +2301,11 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 	ScanKeyData scankey;
 	SysScanDesc scan;
 	ListCell   *option;
+	bool		dbstrictunicode = false;
 	bool		dbistemplate = false;
 	bool		dballowconnections = true;
 	int			dbconnlimit = DATCONNLIMIT_UNLIMITED;
+	DefElem    *dstrictunicode = NULL;
 	DefElem    *distemplate = NULL;
 	DefElem    *dallowconnections = NULL;
 	DefElem    *dconnlimit = NULL;
@@ -2290,7 +2319,13 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 	{
 		DefElem    *defel = (DefElem *) lfirst(option);
 
-		if (strcmp(defel->defname, "is_template") == 0)
+		if (strcmp(defel->defname, "strict_unicode") == 0)
+		{
+			if (dstrictunicode)
+				errorConflictingDefElem(defel, pstate);
+			dstrictunicode = defel;
+		}
+		else if (strcmp(defel->defname, "is_template") == 0)
 		{
 			if (distemplate)
 				errorConflictingDefElem(defel, pstate);
@@ -2340,6 +2375,8 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 		return InvalidOid;
 	}
 
+	if (dstrictunicode && dstrictunicode->arg)
+		dbstrictunicode = defGetBoolean(dstrictunicode);
 	if (distemplate && distemplate->arg)
 		dbistemplate = defGetBoolean(distemplate);
 	if (dallowconnections && dallowconnections->arg)
@@ -2400,6 +2437,15 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 	/*
 	 * Build an updated tuple, perusing the information just obtained
 	 */
+	if (dstrictunicode)
+	{
+		if (dbstrictunicode && !datform->datstrictunicode)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("STRICT_UNICODE cannot be enabled on an existing database")));
+		new_record[Anum_pg_database_datstrictunicode - 1] = BoolGetDatum(dbstrictunicode);
+		new_record_repl[Anum_pg_database_datstrictunicode - 1] = true;
+	}
 	if (distemplate)
 	{
 		new_record[Anum_pg_database_datistemplate - 1] = BoolGetDatum(dbistemplate);
@@ -2695,7 +2741,8 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS)
 static bool
 get_db_info(const char *name, LOCKMODE lockmode,
 			Oid *dbIdP, Oid *ownerIdP,
-			int *encodingP, bool *dbIsTemplateP, bool *dbAllowConnP, bool *dbHasLoginEvtP,
+			int *encodingP, bool *strictunicodeP, bool *dbIsTemplateP,
+			bool *dbAllowConnP, bool *dbHasLoginEvtP,
 			TransactionId *dbFrozenXidP, MultiXactId *dbMinMultiP,
 			Oid *dbTablespace, char **dbCollate, char **dbCtype, char **dbIculocale,
 			char **dbIcurules,
@@ -2777,6 +2824,9 @@ get_db_info(const char *name, LOCKMODE lockmode,
 				/* character encoding */
 				if (encodingP)
 					*encodingP = dbform->encoding;
+				/* reject unassigned code points? (UTF-8 only) */
+				if (strictunicodeP)
+					*strictunicodeP = dbform->datstrictunicode;
 				/* allowed as template? */
 				if (dbIsTemplateP)
 					*dbIsTemplateP = dbform->datistemplate;
diff --git a/src/backend/utils/adt/oracle_compat.c b/src/backend/utils/adt/oracle_compat.c
index b126a7d460..d7061f964f 100644
--- a/src/backend/utils/adt/oracle_compat.c
+++ b/src/backend/utils/adt/oracle_compat.c
@@ -16,11 +16,13 @@
 #include "postgres.h"
 
 #include "common/int.h"
+#include "common/unicode_category.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/formatting.h"
 #include "utils/memutils.h"
+#include "utils/pg_locale.h"
 #include "varatt.h"
 
 
@@ -1030,6 +1032,7 @@ chr			(PG_FUNCTION_ARGS)
 		/* for Unicode we treat the argument as a code point */
 		int			bytes;
 		unsigned char *wch;
+		pg_unicode_category category;
 
 		/*
 		 * We only allow valid Unicode code points; per RFC3629 that stops at
@@ -1042,6 +1045,19 @@ chr			(PG_FUNCTION_ARGS)
 					 errmsg("requested character too large for encoding: %u",
 							cvalue)));
 
+		if (database_strict_unicode)
+		{
+			category = unicode_category(cvalue);
+			if (category == PG_U_UNASSIGNED)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("unassigned Unicode code point: %06X", cvalue));
+			else if (category == PG_U_SURROGATE)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("Unicode code point is surrogate: %06X", cvalue));
+		}
+
 		if (cvalue > 0xffff)
 			bytes = 4;
 		else if (cvalue > 0x07ff)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 79b59b0af7..8ac9a35226 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -114,6 +114,9 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/* reject unassigned code points? (UTF-8 only) */
+bool database_strict_unicode = false;
+
 /* is the databases's LC_CTYPE the C locale? */
 bool		database_ctype_is_c = false;
 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 543afb66e5..e659a54c80 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -138,6 +138,7 @@ static char *text_position_next_internal(char *start_ptr, TextPositionState *sta
 static char *text_position_get_match_ptr(TextPositionState *state);
 static int	text_position_get_match_pos(TextPositionState *state);
 static void text_position_cleanup(TextPositionState *state);
+static void check_strict_unicode(text *input);
 static void check_collation_set(Oid collid);
 static int	text_cmp(text *arg1, text *arg2, Oid collid);
 static bytea *bytea_catenate(bytea *t1, bytea *t2);
@@ -200,6 +201,7 @@ cstring_to_text_with_len(const char *s, int len)
 	SET_VARSIZE(result, len + VARHDRSZ);
 	memcpy(VARDATA(result), s, len);
 
+	check_strict_unicode(result);
 	return result;
 }
 
@@ -609,6 +611,7 @@ textrecv(PG_FUNCTION_ARGS)
 
 	result = cstring_to_text_with_len(str, nbytes);
 	pfree(str);
+
 	PG_RETURN_TEXT_P(result);
 }
 
@@ -6298,6 +6301,38 @@ unicode_assigned(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(true);
 }
 
+static void
+check_strict_unicode(text *input)
+{
+	unsigned char *p;
+	int			size;
+
+	if (!database_strict_unicode)
+		return;
+
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+
+	/* convert to pg_wchar */
+	size = pg_mbstrlen_with_len(VARDATA_ANY(input), VARSIZE_ANY_EXHDR(input));
+	p = (unsigned char *) VARDATA_ANY(input);
+	for (int i = 0; i < size; i++)
+	{
+		pg_wchar	code = utf8_to_unicode(p);
+		int			category = unicode_category(code);
+
+		if (category == PG_U_UNASSIGNED)
+			ereport(ERROR,
+					errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					errmsg("unassigned Unicode code point: %06X", code));
+		else if (category == PG_U_SURROGATE)
+			ereport(ERROR,
+					errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					errmsg("Unicode code point is surrogate: %06X", code));
+
+		p += pg_utf_mblen(p);
+	}
+}
+
 Datum
 unicode_normalize_func(PG_FUNCTION_ARGS)
 {
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 5ffe9bdd98..045e8c07aa 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -401,6 +401,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	SetConfigOption("client_encoding", GetDatabaseEncodingName(),
 					PGC_BACKEND, PGC_S_DYNAMIC_DEFAULT);
 
+	database_strict_unicode = dbform->datstrictunicode;
+
 	/* assign locale variables */
 	datum = SysCacheGetAttrNotNull(DATABASEOID, tup, Anum_pg_database_datcollate);
 	collate = TextDatumGetCString(datum);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ac409b0006..2418a7ba5b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -93,6 +93,13 @@ typedef struct _stringlist
 	struct _stringlist *next;
 } _stringlist;
 
+enum trivalue
+{
+	TRI_DEFAULT,
+	TRI_NO,
+	TRI_YES
+};
+
 static const char *const auth_methods_host[] = {
 	"trust", "reject", "scram-sha-256", "md5", "password", "ident", "radius",
 #ifdef ENABLE_GSS
@@ -149,6 +156,7 @@ static char *icu_locale = NULL;
 static char *icu_rules = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
+static enum trivalue strict_unicode = TRI_DEFAULT;
 static bool pwprompt = false;
 static char *pwfilename = NULL;
 static char *superuser_password = NULL;
@@ -1509,6 +1517,9 @@ bootstrap_template1(void)
 	bki_lines = replace_token(bki_lines, "ENCODING",
 							  encodingid_to_string(encodingid));
 
+	bki_lines = replace_token(bki_lines, "STRICT_UNICODE",
+							  (strict_unicode == TRI_YES) ? "TRUE" : "FALSE");
+
 	bki_lines = replace_token(bki_lines, "LC_COLLATE",
 							  escape_quotes_bki(lc_collate));
 
@@ -2432,6 +2443,8 @@ usage(const char *progname)
 	printf(_("      --auth-local=METHOD   default authentication method for local-socket connections\n"));
 	printf(_(" [-D, --pgdata=]DATADIR     location for this database cluster\n"));
 	printf(_("  -E, --encoding=ENCODING   set default encoding for new databases\n"));
+	printf(_("      --no-strict-unicode   disable strict unicode\n"));
+	printf(_("      --strict-unicode      enable strict unicode\n"));
 	printf(_("  -g, --allow-group-access  allow group read/execute on data directory\n"));
 	printf(_("      --icu-locale=LOCALE   set ICU locale ID for new databases\n"));
 	printf(_("      --icu-rules=RULES     set additional ICU collation rules for new databases\n"));
@@ -3102,6 +3115,8 @@ main(int argc, char *argv[])
 		{"icu-locale", required_argument, NULL, 16},
 		{"icu-rules", required_argument, NULL, 17},
 		{"sync-method", required_argument, NULL, 18},
+		{"no-strict-unicode", no_argument, NULL, 19},
+		{"strict-unicode", no_argument, NULL, 20},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3286,6 +3301,12 @@ main(int argc, char *argv[])
 				if (!parse_sync_method(optarg, &sync_method))
 					exit(1);
 				break;
+			case 19:
+				strict_unicode = TRI_NO;
+				break;
+			case 20:
+				strict_unicode = TRI_YES;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 2225a12718..7b028c0be3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2981,6 +2981,7 @@ dumpDatabase(Archive *fout)
 				i_datname,
 				i_datdba,
 				i_encoding,
+				i_datstrictunicode,
 				i_datlocprovider,
 				i_collate,
 				i_ctype,
@@ -3000,6 +3001,7 @@ dumpDatabase(Archive *fout)
 	const char *datname,
 			   *dba,
 			   *encoding,
+			   *datstrictunicode,
 			   *datlocprovider,
 			   *collate,
 			   *ctype,
@@ -3035,6 +3037,10 @@ dumpDatabase(Archive *fout)
 		appendPQExpBufferStr(dbQry, "daticurules, ");
 	else
 		appendPQExpBufferStr(dbQry, "NULL AS daticurules, ");
+	if (fout->remoteVersion >= 170000)
+		appendPQExpBufferStr(dbQry, "datstrictunicode, ");
+	else
+		appendPQExpBufferStr(dbQry, "'f' AS datstrictunicode, ");
 	appendPQExpBufferStr(dbQry,
 						 "(SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, "
 						 "shobj_description(oid, 'pg_database') AS description "
@@ -3048,6 +3054,7 @@ dumpDatabase(Archive *fout)
 	i_datname = PQfnumber(res, "datname");
 	i_datdba = PQfnumber(res, "datdba");
 	i_encoding = PQfnumber(res, "encoding");
+	i_datstrictunicode = PQfnumber(res, "datstrictunicode");
 	i_datlocprovider = PQfnumber(res, "datlocprovider");
 	i_collate = PQfnumber(res, "datcollate");
 	i_ctype = PQfnumber(res, "datctype");
@@ -3067,6 +3074,7 @@ dumpDatabase(Archive *fout)
 	datname = PQgetvalue(res, 0, i_datname);
 	dba = getRoleName(PQgetvalue(res, 0, i_datdba));
 	encoding = PQgetvalue(res, 0, i_encoding);
+	datstrictunicode = PQgetvalue(res, 0, i_datstrictunicode);
 	datlocprovider = PQgetvalue(res, 0, i_datlocprovider);
 	collate = PQgetvalue(res, 0, i_collate);
 	ctype = PQgetvalue(res, 0, i_ctype);
@@ -3111,6 +3119,10 @@ dumpDatabase(Archive *fout)
 		appendStringLiteralAH(creaQry, encoding, fout);
 	}
 
+	if (strcmp(datstrictunicode, "t") == 0)
+		appendPQExpBufferStr(creaQry, " STRICT_UNICODE = TRUE");
+	else
+		appendPQExpBufferStr(creaQry, " STRICT_UNICODE = FALSE");
 	appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
 	if (datlocprovider[0] == 'c')
 		appendPQExpBufferStr(creaQry, "libc");
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index b6a4eb1d56..0873bddbb6 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -953,6 +953,17 @@ listAllDbs(const char *pattern, bool verbose)
 		appendPQExpBuffer(&buf,
 						  "  NULL as \"%s\",\n",
 						  gettext_noop("ICU Rules"));
+	if (verbose)
+	{
+		if (pset.sversion >= 170000)
+			appendPQExpBuffer(&buf,
+							  "  d.datstrictunicode as \"%s\",\n",
+							  gettext_noop("Strict Unicode"));
+		else
+			appendPQExpBuffer(&buf,
+							  "  'f' as \"%s\",\n",
+							  gettext_noop("Strict Unicode"));
+	}
 	appendPQExpBufferStr(&buf, "  ");
 	printACLColumn(&buf, "d.datacl");
 	if (verbose)
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 14970a6a5f..3f8f8d27fb 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -42,6 +42,8 @@ main(int argc, char *argv[])
 		{"locale-provider", required_argument, NULL, 4},
 		{"icu-locale", required_argument, NULL, 5},
 		{"icu-rules", required_argument, NULL, 6},
+		{"no-strict-unicode", no_argument, NULL, 7},
+		{"strict-unicode", no_argument, NULL, 8},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -55,6 +57,7 @@ main(int argc, char *argv[])
 	char	   *host = NULL;
 	char	   *port = NULL;
 	char	   *username = NULL;
+	enum trivalue strict_unicode = TRI_DEFAULT;
 	enum trivalue prompt_password = TRI_DEFAULT;
 	ConnParams	cparams;
 	bool		echo = false;
@@ -139,6 +142,12 @@ main(int argc, char *argv[])
 			case 6:
 				icu_rules = pg_strdup(optarg);
 				break;
+			case 7:
+				strict_unicode = TRI_NO;
+				break;
+			case 8:
+				strict_unicode = TRI_YES;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -207,6 +216,12 @@ main(int argc, char *argv[])
 		appendPQExpBufferStr(&sql, " ENCODING ");
 		appendStringLiteralConn(&sql, encoding, conn);
 	}
+	if (strict_unicode != TRI_DEFAULT)
+	{
+		const char *val = (strict_unicode == TRI_YES) ? "TRUE" : "FALSE";
+		appendPQExpBufferStr(&sql, " STRICT_UNICODE ");
+		appendStringLiteralConn(&sql, val, conn);
+	}
 	if (strategy)
 		appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
 	if (template)
diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat
index 4306e8a3e8..330f11133d 100644
--- a/src/include/catalog/pg_database.dat
+++ b/src/include/catalog/pg_database.dat
@@ -15,6 +15,7 @@
 { oid => '1', oid_symbol => 'Template1DbOid',
   descr => 'default template for new databases',
   datname => 'template1', encoding => 'ENCODING',
+  datstrictunicode => 'STRICT_UNICODE',
   datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't',
   datallowconn => 't', dathasloginevt => 'f', datconnlimit => '-1', datfrozenxid => '0',
   datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE',
diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h
index 014baa7bab..21b512818b 100644
--- a/src/include/catalog/pg_database.h
+++ b/src/include/catalog/pg_database.h
@@ -52,6 +52,9 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID
 	/* database has login event triggers? */
 	bool		dathasloginevt;
 
+	/* reject unassigned code points? (UTF-8 only) */
+	bool		datstrictunicode BKI_DEFAULT(false);
+
 	/*
 	 * Max connections allowed. Negative values have special meaning, see
 	 * DATCONNLIMIT_* defines below.
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 28c925b5af..f48853b98f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -48,6 +48,9 @@ extern PGDLLIMPORT char *localized_full_days[];
 extern PGDLLIMPORT char *localized_abbrev_months[];
 extern PGDLLIMPORT char *localized_full_months[];
 
+/* reject unassigned code points? (UTF-8 only) */
+extern PGDLLIMPORT bool database_strict_unicode;
+
 /* is the databases's LC_CTYPE the C locale? */
 extern PGDLLIMPORT bool database_ctype_is_c;
 
-- 
2.34.1

Re: Pre-proposal: unicode normalized text

Reply via email to