On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful.
Attached is an implementation of a per-database option STRICT_UNICODE which enforces the use of assigned code points only. Not everyone would want to use it. There are lots of applications that accept free-form text, and that may include recently-assigned code points not yet recognized by Postgres. But it would offer protection/stability for some databases. It makes it possible to have a hard guarantee that Unicode normalization is stable[1]. And it may also mitigate the risk of collation changes -- using unassigned code points carries a high risk that the collation order changes as soon as the collation provider recognizes the assignment. (Though assigned code points can change, too, so limiting yourself to assigned code points is only a mitigation.) I worry slightly that users will think at first that they want only assigned code points, and then later figure out that the application has increased in scope and now takes all kinds of free-form text. In that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and follow up with some "CHECK (unicode_assigned(...))" constraints on the particular fields that they'd like to protect. There's some weirdness that the set of assigned code points as Postgres sees it may not match what a collation provider sees due to differing Unicode versions. That's not great -- perhaps we could check that code points are considered assigned by *both* Postgres and ICU. I don't know if there's a way to tell if libc considers a code point to be assigned. Regards, Jeff Davis [1] https://www.unicode.org/policies/stability_policy.html#Normalization
From 54a15ee4ac5d5f437f4d536d724e1fa9e535fd50 Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Thu, 29 Feb 2024 13:13:58 -0800 Subject: [PATCH v1] CREATE DATABASE ... STRICT_UNICODE. Introduce new per-database option STRICT_UNICODE, which causes Postgres to reject any textual value containing unassigned code points. (Surrogate halves were already rejected because they are invalid for UTF-8.) "Unassigned" means unassigned as of the version of Unicode that Postgres is based on; that is, the version returned by the SQL function unicode_version(). By rejecting unassigned code points, it helps stabilize the database against semantic changes across Postgres versions resulting from assignment of previously-unassigned code points. For instance, Unicode normalization is only stable across Unicode versions when using assigned code points. New databases may use STRICT_UNICODE if the template also uses STRICT_UNICODE, or if the template is template0. An existing database may be altered to disable STRICT_UNICODE (and therefore allow unassigned code points), but may not be altered to enable STRICT_UNICODE (because existing values may contain unassigned code points). Discussion: https://postgr.es/m/f30b58657ceb71d5be032decf4058d454cc1df74.camel%40j-davis.com --- doc/src/sgml/ref/alter_database.sgml | 33 ++++++++++++++ doc/src/sgml/ref/create_database.sgml | 23 ++++++++++ doc/src/sgml/ref/createdb.sgml | 23 ++++++++++ doc/src/sgml/ref/initdb.sgml | 23 ++++++++++ src/backend/commands/dbcommands.c | 64 ++++++++++++++++++++++++--- src/backend/utils/adt/oracle_compat.c | 16 +++++++ src/backend/utils/adt/pg_locale.c | 3 ++ src/backend/utils/adt/varlena.c | 35 +++++++++++++++ src/backend/utils/init/postinit.c | 2 + src/bin/initdb/initdb.c | 21 +++++++++ src/bin/pg_dump/pg_dump.c | 12 +++++ src/bin/psql/describe.c | 11 +++++ src/bin/scripts/createdb.c | 15 +++++++ src/include/catalog/pg_database.dat | 1 + src/include/catalog/pg_database.h | 3 ++ src/include/utils/pg_locale.h | 3 ++ 16 files changed, 281 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/ref/alter_database.sgml b/doc/src/sgml/ref/alter_database.sgml index 2479c41e8d..07e42dbdd4 100644 --- a/doc/src/sgml/ref/alter_database.sgml +++ b/doc/src/sgml/ref/alter_database.sgml @@ -25,6 +25,7 @@ ALTER DATABASE <replaceable class="parameter">name</replaceable> [ [ WITH ] <rep <phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase> + STRICT_UNICODE <replaceable class="parameter">strict_unicode</replaceable> ALLOW_CONNECTIONS <replaceable class="parameter">allowconn</replaceable> CONNECTION LIMIT <replaceable class="parameter">connlimit</replaceable> IS_TEMPLATE <replaceable class="parameter">istemplate</replaceable> @@ -112,6 +113,38 @@ ALTER DATABASE <replaceable class="parameter">name</replaceable> RESET ALL </listitem> </varlistentry> + <varlistentry> + <term><replaceable class="parameter">strict_unicode</replaceable></term> + <listitem> + <para> + If <literal>true</literal>, specifies that the initial databases will + reject Unicode code points that are unassigned as of the version of + Unicode returned by <function>unicode_version()</function> (See <xref + linkend="functions-version"/>). Only valid if the encoding is + <literal>UTF8</literal>. + </para> + <para> + This setting may be changed from <literal>true</literal> to + <literal>false</literal> to enable storing textual values containing + unassigned Unicode code points. However, this setting may not be + changed from <literal>false</literal> to <literal>true</literal>, + because existing textual values in the database might contain + unassigned Unicode code points. A changed setting is recognized in + new connections. + </para> + <note> + <para> + This option affects all textual fields in the initial databases, and + should only be used when the applications control the text + input. Furthermore, it may not be possible to use recently-assigned + code points if <productname>PostgreSQL</productname> is based on an + older version of Unicode that does not yet recognize the new + assignments. + </para> + </note> + </listitem> + </varlistentry> + <varlistentry> <term><replaceable class="parameter">allowconn</replaceable></term> <listitem> diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index 72927960eb..c546789d28 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -25,6 +25,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable> [ WITH ] [ OWNER [=] <replaceable class="parameter">user_name</replaceable> ] [ TEMPLATE [=] <replaceable class="parameter">template</replaceable> ] [ ENCODING [=] <replaceable class="parameter">encoding</replaceable> ] + [ STRICT_UNICODE [=] <replaceable class="parameter">strict_unicode</replaceable> ] [ STRATEGY [=] <replaceable class="parameter">strategy</replaceable> ] [ LOCALE [=] <replaceable class="parameter">locale</replaceable> ] [ LC_COLLATE [=] <replaceable class="parameter">lc_collate</replaceable> ] @@ -120,6 +121,28 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable> </para> </listitem> </varlistentry> + <varlistentry id="create-database-strict-unicode"> + <term><replaceable class="parameter">strict_unicode</replaceable></term> + <listitem> + <para> + If <literal>true</literal>, specifies that the initial databases will + reject Unicode code points that are unassigned as of the version of + Unicode returned by <function>unicode_version()</function> (See <xref + linkend="functions-version"/>). Only valid if the encoding is + <literal>UTF8</literal>. + </para> + <note> + <para> + This option affects all textual fields in the initial databases, and + should only be used when the applications control the text + input. Furthermore, it may not be possible to use recently-assigned + code points if <productname>PostgreSQL</productname> is based on an + older version of Unicode that does not yet recognize the new + assignments. + </para> + </note> + </listitem> + </varlistentry> <varlistentry id="create-database-strategy" xreflabel="CREATE DATABASE STRATEGY"> <term><replaceable class="parameter">strategy</replaceable></term> <listitem> diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml index e4647d5ce7..d2b8014b59 100644 --- a/doc/src/sgml/ref/createdb.sgml +++ b/doc/src/sgml/ref/createdb.sgml @@ -118,6 +118,29 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry> + <term><option>--strict-unicode</option></term> + <listitem> + <para> + Specifies that the database will reject Unicode code points that are + unassigned as of the version of Unicode returned by + <function>unicode_version()</function> (See <xref + linkend="functions-version"/>). Only valid if the encoding is + <literal>UTF8</literal>. + </para> + <note> + <para> + This option affects all textual fields in the database, and should + only be used when the applications control the text + input. Furthermore, it may not be possible to use recently-assigned + code points if <productname>PostgreSQL</productname> is based on an + older version of Unicode that does not yet recognize the new + assignments. + </para> + </note> + </listitem> + </varlistentry> + <varlistentry> <term><option>-l <replaceable class="parameter">locale</replaceable></option></term> <term><option>--locale=<replaceable class="parameter">locale</replaceable></option></term> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index cd75cae10e..4242aea278 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -227,6 +227,29 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry id="app-initdb-option-strict-unicode"> + <term><option>--strict-unicode</option></term> + <listitem> + <para> + Specifies that the initial databases will reject Unicode code points + that are unassigned as of the version of Unicode returned by + <function>unicode_version()</function> (See <xref + linkend="functions-version"/>). Only valid if the encoding is + <literal>UTF8</literal>. + </para> + <note> + <para> + This option affects all textual fields in the initial databases, and + should only be used when the applications control the text + input. Furthermore, it may not be possible to use recently-assigned + code points if <productname>PostgreSQL</productname> is based on an + older version of Unicode that does not yet recognize the new + assignments. + </para> + </note> + </listitem> + </varlistentry> + <varlistentry id="app-initdb-allow-group-access" xreflabel="group access"> <term><option>-g</option></term> <term><option>--allow-group-access</option></term> diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index b1327de71e..9524d4447c 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -116,7 +116,8 @@ static void movedb(const char *dbname, const char *tblspcname); static void movedb_failure_callback(int code, Datum arg); static bool get_db_info(const char *name, LOCKMODE lockmode, Oid *dbIdP, Oid *ownerIdP, - int *encodingP, bool *dbIsTemplateP, bool *dbAllowConnP, bool *dbHasLoginEvtP, + int *encodingP, bool *dbstrictunicodeP, bool *dbIsTemplateP, + bool *dbAllowConnP, bool *dbHasLoginEvtP, TransactionId *dbFrozenXidP, MultiXactId *dbMinMultiP, Oid *dbTablespace, char **dbCollate, char **dbCtype, char **dbIculocale, char **dbIcurules, @@ -673,6 +674,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) Oid src_dboid; Oid src_owner; int src_encoding = -1; + bool src_strictunicode = false; char *src_collate = NULL; char *src_ctype = NULL; char *src_iculocale = NULL; @@ -697,6 +699,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) DefElem *downer = NULL; DefElem *dtemplate = NULL; DefElem *dencoding = NULL; + DefElem *dstrictunicode = NULL; DefElem *dlocale = NULL; DefElem *dcollate = NULL; DefElem *dctype = NULL; @@ -718,6 +721,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) char dblocprovider = '\0'; char *canonname; int encoding = -1; + bool dbstrictunicode = false; bool dbistemplate = false; bool dballowconnections = true; int dbconnlimit = DATCONNLIMIT_UNLIMITED; @@ -756,6 +760,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) errorConflictingDefElem(defel, pstate); dencoding = defel; } + else if (strcmp(defel->defname, "strict_unicode") == 0) + { + if (dstrictunicode) + errorConflictingDefElem(defel, pstate); + dstrictunicode = defel; + } else if (strcmp(defel->defname, "locale") == 0) { if (dlocale) @@ -893,6 +903,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) parser_errposition(pstate, dencoding->location))); } } + if (dstrictunicode) + dbstrictunicode = defGetBoolean(dstrictunicode); if (dlocale && dlocale->arg) { dbcollate = defGetString(dlocale); @@ -968,7 +980,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dbtemplate = "template1"; /* Default template database name */ if (!get_db_info(dbtemplate, ShareLock, - &src_dboid, &src_owner, &src_encoding, + &src_dboid, &src_owner, &src_encoding, &src_strictunicode, &src_istemplate, &src_allowconn, &src_hasloginevt, &src_frozenxid, &src_minmxid, &src_deftablespace, &src_collate, &src_ctype, &src_iculocale, &src_icurules, &src_locprovider, @@ -1021,6 +1033,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) /* If encoding or locales are defaulted, use source's setting */ if (encoding < 0) encoding = src_encoding; + if (!dstrictunicode) + dbstrictunicode = src_strictunicode; if (dbcollate == NULL) dbcollate = src_collate; if (dbctype == NULL) @@ -1057,6 +1071,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) errhint("If the locale name is specific to ICU, use ICU_LOCALE."))); dbctype = canonname; + if (dbstrictunicode && encoding != PG_UTF8) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("encoding \"%s\" does not support STRICT_UNICODE", + pg_encoding_to_char(encoding)))); + check_encoding_locale_matches(encoding, dbcollate, dbctype); if (dblocprovider == COLLPROVIDER_ICU) @@ -1131,6 +1151,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) pg_encoding_to_char(src_encoding)), errhint("Use the same encoding as in the template database, or use template0 as template."))); + if (dbstrictunicode && !src_strictunicode) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("STRICT_UNICODE is incompatible with the template database"), + errhint("Use a template database with STRICT_UNICODE, or use template0 as template."))); + if (strcmp(dbcollate, src_collate) != 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), @@ -1373,6 +1399,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) DirectFunctionCall1(namein, CStringGetDatum(dbname)); new_record[Anum_pg_database_datdba - 1] = ObjectIdGetDatum(datdba); new_record[Anum_pg_database_encoding - 1] = Int32GetDatum(encoding); + new_record[Anum_pg_database_datstrictunicode - 1] = BoolGetDatum(dbstrictunicode); new_record[Anum_pg_database_datlocprovider - 1] = CharGetDatum(dblocprovider); new_record[Anum_pg_database_datistemplate - 1] = BoolGetDatum(dbistemplate); new_record[Anum_pg_database_datallowconn - 1] = BoolGetDatum(dballowconnections); @@ -1604,7 +1631,7 @@ dropdb(const char *dbname, bool missing_ok, bool force) */ pgdbrel = table_open(DatabaseRelationId, RowExclusiveLock); - if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, + if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, &db_istemplate, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)) { if (!missing_ok) @@ -1819,7 +1846,7 @@ RenameDatabase(const char *oldname, const char *newname) */ rel = table_open(DatabaseRelationId, RowExclusiveLock); - if (!get_db_info(oldname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, + if (!get_db_info(oldname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_DATABASE), @@ -1929,7 +1956,7 @@ movedb(const char *dbname, const char *tblspcname) */ pgdbrel = table_open(DatabaseRelationId, RowExclusiveLock); - if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, + if (!get_db_info(dbname, AccessExclusiveLock, &db_id, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, &src_tblspcoid, NULL, NULL, NULL, NULL, NULL, NULL)) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_DATABASE), @@ -2274,9 +2301,11 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel) ScanKeyData scankey; SysScanDesc scan; ListCell *option; + bool dbstrictunicode = false; bool dbistemplate = false; bool dballowconnections = true; int dbconnlimit = DATCONNLIMIT_UNLIMITED; + DefElem *dstrictunicode = NULL; DefElem *distemplate = NULL; DefElem *dallowconnections = NULL; DefElem *dconnlimit = NULL; @@ -2290,7 +2319,13 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel) { DefElem *defel = (DefElem *) lfirst(option); - if (strcmp(defel->defname, "is_template") == 0) + if (strcmp(defel->defname, "strict_unicode") == 0) + { + if (dstrictunicode) + errorConflictingDefElem(defel, pstate); + dstrictunicode = defel; + } + else if (strcmp(defel->defname, "is_template") == 0) { if (distemplate) errorConflictingDefElem(defel, pstate); @@ -2340,6 +2375,8 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel) return InvalidOid; } + if (dstrictunicode && dstrictunicode->arg) + dbstrictunicode = defGetBoolean(dstrictunicode); if (distemplate && distemplate->arg) dbistemplate = defGetBoolean(distemplate); if (dallowconnections && dallowconnections->arg) @@ -2400,6 +2437,15 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel) /* * Build an updated tuple, perusing the information just obtained */ + if (dstrictunicode) + { + if (dbstrictunicode && !datform->datstrictunicode) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("STRICT_UNICODE cannot be enabled on an existing database"))); + new_record[Anum_pg_database_datstrictunicode - 1] = BoolGetDatum(dbstrictunicode); + new_record_repl[Anum_pg_database_datstrictunicode - 1] = true; + } if (distemplate) { new_record[Anum_pg_database_datistemplate - 1] = BoolGetDatum(dbistemplate); @@ -2695,7 +2741,8 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS) static bool get_db_info(const char *name, LOCKMODE lockmode, Oid *dbIdP, Oid *ownerIdP, - int *encodingP, bool *dbIsTemplateP, bool *dbAllowConnP, bool *dbHasLoginEvtP, + int *encodingP, bool *strictunicodeP, bool *dbIsTemplateP, + bool *dbAllowConnP, bool *dbHasLoginEvtP, TransactionId *dbFrozenXidP, MultiXactId *dbMinMultiP, Oid *dbTablespace, char **dbCollate, char **dbCtype, char **dbIculocale, char **dbIcurules, @@ -2777,6 +2824,9 @@ get_db_info(const char *name, LOCKMODE lockmode, /* character encoding */ if (encodingP) *encodingP = dbform->encoding; + /* reject unassigned code points? (UTF-8 only) */ + if (strictunicodeP) + *strictunicodeP = dbform->datstrictunicode; /* allowed as template? */ if (dbIsTemplateP) *dbIsTemplateP = dbform->datistemplate; diff --git a/src/backend/utils/adt/oracle_compat.c b/src/backend/utils/adt/oracle_compat.c index b126a7d460..d7061f964f 100644 --- a/src/backend/utils/adt/oracle_compat.c +++ b/src/backend/utils/adt/oracle_compat.c @@ -16,11 +16,13 @@ #include "postgres.h" #include "common/int.h" +#include "common/unicode_category.h" #include "mb/pg_wchar.h" #include "miscadmin.h" #include "utils/builtins.h" #include "utils/formatting.h" #include "utils/memutils.h" +#include "utils/pg_locale.h" #include "varatt.h" @@ -1030,6 +1032,7 @@ chr (PG_FUNCTION_ARGS) /* for Unicode we treat the argument as a code point */ int bytes; unsigned char *wch; + pg_unicode_category category; /* * We only allow valid Unicode code points; per RFC3629 that stops at @@ -1042,6 +1045,19 @@ chr (PG_FUNCTION_ARGS) errmsg("requested character too large for encoding: %u", cvalue))); + if (database_strict_unicode) + { + category = unicode_category(cvalue); + if (category == PG_U_UNASSIGNED) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unassigned Unicode code point: %06X", cvalue)); + else if (category == PG_U_SURROGATE) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Unicode code point is surrogate: %06X", cvalue)); + } + if (cvalue > 0xffff) bytes = 4; else if (cvalue > 0x07ff) diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index 79b59b0af7..8ac9a35226 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -114,6 +114,9 @@ char *localized_full_days[7 + 1]; char *localized_abbrev_months[12 + 1]; char *localized_full_months[12 + 1]; +/* reject unassigned code points? (UTF-8 only) */ +bool database_strict_unicode = false; + /* is the databases's LC_CTYPE the C locale? */ bool database_ctype_is_c = false; diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index 543afb66e5..e659a54c80 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -138,6 +138,7 @@ static char *text_position_next_internal(char *start_ptr, TextPositionState *sta static char *text_position_get_match_ptr(TextPositionState *state); static int text_position_get_match_pos(TextPositionState *state); static void text_position_cleanup(TextPositionState *state); +static void check_strict_unicode(text *input); static void check_collation_set(Oid collid); static int text_cmp(text *arg1, text *arg2, Oid collid); static bytea *bytea_catenate(bytea *t1, bytea *t2); @@ -200,6 +201,7 @@ cstring_to_text_with_len(const char *s, int len) SET_VARSIZE(result, len + VARHDRSZ); memcpy(VARDATA(result), s, len); + check_strict_unicode(result); return result; } @@ -609,6 +611,7 @@ textrecv(PG_FUNCTION_ARGS) result = cstring_to_text_with_len(str, nbytes); pfree(str); + PG_RETURN_TEXT_P(result); } @@ -6298,6 +6301,38 @@ unicode_assigned(PG_FUNCTION_ARGS) PG_RETURN_BOOL(true); } +static void +check_strict_unicode(text *input) +{ + unsigned char *p; + int size; + + if (!database_strict_unicode) + return; + + Assert(GetDatabaseEncoding() == PG_UTF8); + + /* convert to pg_wchar */ + size = pg_mbstrlen_with_len(VARDATA_ANY(input), VARSIZE_ANY_EXHDR(input)); + p = (unsigned char *) VARDATA_ANY(input); + for (int i = 0; i < size; i++) + { + pg_wchar code = utf8_to_unicode(p); + int category = unicode_category(code); + + if (category == PG_U_UNASSIGNED) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unassigned Unicode code point: %06X", code)); + else if (category == PG_U_SURROGATE) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Unicode code point is surrogate: %06X", code)); + + p += pg_utf_mblen(p); + } +} + Datum unicode_normalize_func(PG_FUNCTION_ARGS) { diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 5ffe9bdd98..045e8c07aa 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -401,6 +401,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect SetConfigOption("client_encoding", GetDatabaseEncodingName(), PGC_BACKEND, PGC_S_DYNAMIC_DEFAULT); + database_strict_unicode = dbform->datstrictunicode; + /* assign locale variables */ datum = SysCacheGetAttrNotNull(DATABASEOID, tup, Anum_pg_database_datcollate); collate = TextDatumGetCString(datum); diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index ac409b0006..2418a7ba5b 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -93,6 +93,13 @@ typedef struct _stringlist struct _stringlist *next; } _stringlist; +enum trivalue +{ + TRI_DEFAULT, + TRI_NO, + TRI_YES +}; + static const char *const auth_methods_host[] = { "trust", "reject", "scram-sha-256", "md5", "password", "ident", "radius", #ifdef ENABLE_GSS @@ -149,6 +156,7 @@ static char *icu_locale = NULL; static char *icu_rules = NULL; static const char *default_text_search_config = NULL; static char *username = NULL; +static enum trivalue strict_unicode = TRI_DEFAULT; static bool pwprompt = false; static char *pwfilename = NULL; static char *superuser_password = NULL; @@ -1509,6 +1517,9 @@ bootstrap_template1(void) bki_lines = replace_token(bki_lines, "ENCODING", encodingid_to_string(encodingid)); + bki_lines = replace_token(bki_lines, "STRICT_UNICODE", + (strict_unicode == TRI_YES) ? "TRUE" : "FALSE"); + bki_lines = replace_token(bki_lines, "LC_COLLATE", escape_quotes_bki(lc_collate)); @@ -2432,6 +2443,8 @@ usage(const char *progname) printf(_(" --auth-local=METHOD default authentication method for local-socket connections\n")); printf(_(" [-D, --pgdata=]DATADIR location for this database cluster\n")); printf(_(" -E, --encoding=ENCODING set default encoding for new databases\n")); + printf(_(" --no-strict-unicode disable strict unicode\n")); + printf(_(" --strict-unicode enable strict unicode\n")); printf(_(" -g, --allow-group-access allow group read/execute on data directory\n")); printf(_(" --icu-locale=LOCALE set ICU locale ID for new databases\n")); printf(_(" --icu-rules=RULES set additional ICU collation rules for new databases\n")); @@ -3102,6 +3115,8 @@ main(int argc, char *argv[]) {"icu-locale", required_argument, NULL, 16}, {"icu-rules", required_argument, NULL, 17}, {"sync-method", required_argument, NULL, 18}, + {"no-strict-unicode", no_argument, NULL, 19}, + {"strict-unicode", no_argument, NULL, 20}, {NULL, 0, NULL, 0} }; @@ -3286,6 +3301,12 @@ main(int argc, char *argv[]) if (!parse_sync_method(optarg, &sync_method)) exit(1); break; + case 19: + strict_unicode = TRI_NO; + break; + case 20: + strict_unicode = TRI_YES; + break; default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 2225a12718..7b028c0be3 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2981,6 +2981,7 @@ dumpDatabase(Archive *fout) i_datname, i_datdba, i_encoding, + i_datstrictunicode, i_datlocprovider, i_collate, i_ctype, @@ -3000,6 +3001,7 @@ dumpDatabase(Archive *fout) const char *datname, *dba, *encoding, + *datstrictunicode, *datlocprovider, *collate, *ctype, @@ -3035,6 +3037,10 @@ dumpDatabase(Archive *fout) appendPQExpBufferStr(dbQry, "daticurules, "); else appendPQExpBufferStr(dbQry, "NULL AS daticurules, "); + if (fout->remoteVersion >= 170000) + appendPQExpBufferStr(dbQry, "datstrictunicode, "); + else + appendPQExpBufferStr(dbQry, "'f' AS datstrictunicode, "); appendPQExpBufferStr(dbQry, "(SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, " "shobj_description(oid, 'pg_database') AS description " @@ -3048,6 +3054,7 @@ dumpDatabase(Archive *fout) i_datname = PQfnumber(res, "datname"); i_datdba = PQfnumber(res, "datdba"); i_encoding = PQfnumber(res, "encoding"); + i_datstrictunicode = PQfnumber(res, "datstrictunicode"); i_datlocprovider = PQfnumber(res, "datlocprovider"); i_collate = PQfnumber(res, "datcollate"); i_ctype = PQfnumber(res, "datctype"); @@ -3067,6 +3074,7 @@ dumpDatabase(Archive *fout) datname = PQgetvalue(res, 0, i_datname); dba = getRoleName(PQgetvalue(res, 0, i_datdba)); encoding = PQgetvalue(res, 0, i_encoding); + datstrictunicode = PQgetvalue(res, 0, i_datstrictunicode); datlocprovider = PQgetvalue(res, 0, i_datlocprovider); collate = PQgetvalue(res, 0, i_collate); ctype = PQgetvalue(res, 0, i_ctype); @@ -3111,6 +3119,10 @@ dumpDatabase(Archive *fout) appendStringLiteralAH(creaQry, encoding, fout); } + if (strcmp(datstrictunicode, "t") == 0) + appendPQExpBufferStr(creaQry, " STRICT_UNICODE = TRUE"); + else + appendPQExpBufferStr(creaQry, " STRICT_UNICODE = FALSE"); appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = "); if (datlocprovider[0] == 'c') appendPQExpBufferStr(creaQry, "libc"); diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index b6a4eb1d56..0873bddbb6 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -953,6 +953,17 @@ listAllDbs(const char *pattern, bool verbose) appendPQExpBuffer(&buf, " NULL as \"%s\",\n", gettext_noop("ICU Rules")); + if (verbose) + { + if (pset.sversion >= 170000) + appendPQExpBuffer(&buf, + " d.datstrictunicode as \"%s\",\n", + gettext_noop("Strict Unicode")); + else + appendPQExpBuffer(&buf, + " 'f' as \"%s\",\n", + gettext_noop("Strict Unicode")); + } appendPQExpBufferStr(&buf, " "); printACLColumn(&buf, "d.datacl"); if (verbose) diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c index 14970a6a5f..3f8f8d27fb 100644 --- a/src/bin/scripts/createdb.c +++ b/src/bin/scripts/createdb.c @@ -42,6 +42,8 @@ main(int argc, char *argv[]) {"locale-provider", required_argument, NULL, 4}, {"icu-locale", required_argument, NULL, 5}, {"icu-rules", required_argument, NULL, 6}, + {"no-strict-unicode", no_argument, NULL, 7}, + {"strict-unicode", no_argument, NULL, 8}, {NULL, 0, NULL, 0} }; @@ -55,6 +57,7 @@ main(int argc, char *argv[]) char *host = NULL; char *port = NULL; char *username = NULL; + enum trivalue strict_unicode = TRI_DEFAULT; enum trivalue prompt_password = TRI_DEFAULT; ConnParams cparams; bool echo = false; @@ -139,6 +142,12 @@ main(int argc, char *argv[]) case 6: icu_rules = pg_strdup(optarg); break; + case 7: + strict_unicode = TRI_NO; + break; + case 8: + strict_unicode = TRI_YES; + break; default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -207,6 +216,12 @@ main(int argc, char *argv[]) appendPQExpBufferStr(&sql, " ENCODING "); appendStringLiteralConn(&sql, encoding, conn); } + if (strict_unicode != TRI_DEFAULT) + { + const char *val = (strict_unicode == TRI_YES) ? "TRUE" : "FALSE"; + appendPQExpBufferStr(&sql, " STRICT_UNICODE "); + appendStringLiteralConn(&sql, val, conn); + } if (strategy) appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy)); if (template) diff --git a/src/include/catalog/pg_database.dat b/src/include/catalog/pg_database.dat index 4306e8a3e8..330f11133d 100644 --- a/src/include/catalog/pg_database.dat +++ b/src/include/catalog/pg_database.dat @@ -15,6 +15,7 @@ { oid => '1', oid_symbol => 'Template1DbOid', descr => 'default template for new databases', datname => 'template1', encoding => 'ENCODING', + datstrictunicode => 'STRICT_UNICODE', datlocprovider => 'LOCALE_PROVIDER', datistemplate => 't', datallowconn => 't', dathasloginevt => 'f', datconnlimit => '-1', datfrozenxid => '0', datminmxid => '1', dattablespace => 'pg_default', datcollate => 'LC_COLLATE', diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h index 014baa7bab..21b512818b 100644 --- a/src/include/catalog/pg_database.h +++ b/src/include/catalog/pg_database.h @@ -52,6 +52,9 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID /* database has login event triggers? */ bool dathasloginevt; + /* reject unassigned code points? (UTF-8 only) */ + bool datstrictunicode BKI_DEFAULT(false); + /* * Max connections allowed. Negative values have special meaning, see * DATCONNLIMIT_* defines below. diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h index 28c925b5af..f48853b98f 100644 --- a/src/include/utils/pg_locale.h +++ b/src/include/utils/pg_locale.h @@ -48,6 +48,9 @@ extern PGDLLIMPORT char *localized_full_days[]; extern PGDLLIMPORT char *localized_abbrev_months[]; extern PGDLLIMPORT char *localized_full_months[]; +/* reject unassigned code points? (UTF-8 only) */ +extern PGDLLIMPORT bool database_strict_unicode; + /* is the databases's LC_CTYPE the C locale? */ extern PGDLLIMPORT bool database_ctype_is_c; -- 2.34.1