The SQL standard defines several standard collations. Most of them are only of legacy interest (IMO), but two are currently relevant: UNICODE and UCS_BASIC. UNICODE sorts by the default Unicode collation algorithm specifications and UCS_BASIC sorts by codepoint.

When collation support was added to PostgreSQL, we added UCS_BASIC, since that could easily be mapped to the C locale. But there was no straightforward way to provide the UNICODE collation. (Recall that collation support came several releases before ICU support.)

With ICU support, we can provide the UNICODE collation, since it's just the root locale. I suppose one hesitation was that ICU was not a standard feature, so this would create variations in the default catalog contents, or something like that. But I think now that we are drifting to make ICU more prominent, we can just add that anyway. I think being able to say

    COLLATE UNICODE

instead of

    COLLATE "und-x-icu"

or whatever it is, is pretty useful.

So, attached is a small patch to add this.
From 98fce30e4997253e5b010b6ac72d66255bc77bf6 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <pe...@eisentraut.org>
Date: Wed, 1 Mar 2023 10:38:19 +0100
Subject: [PATCH] Add standard collation UNICODE

---
 doc/src/sgml/charset.sgml | 30 +++++++++++++++++++++++++++---
 src/bin/initdb/initdb.c   | 10 +++++++---
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 3032392b80..13ec238a81 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -659,9 +659,33 @@ <title>Standard Collations</title>
    </para>
 
    <para>
-    Additionally, the SQL standard collation name <literal>ucs_basic</literal>
-    is available for encoding <literal>UTF8</literal>.  It is equivalent
-    to <literal>C</literal> and sorts by Unicode code point.
+    Additionally, two SQL standard collation names are available for encoding
+    <literal>UTF8</literal>:
+
+    <variablelist>
+     <varlistentry>
+      <term><literal>unicode</literal></term>
+      <listitem>
+       <para>
+        This collation sorts using the Unicode Collation Algorithm with the
+        Default Unicode Collation Element Table.  (This is the same behavior
+        as the ICU root locale; see <xref
+        linkend="collation-managing-predefined-icu-und-x-icu"/>.)  This
+        collation is only available when ICU support is configured.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><literal>ucs_basic</literal></term>
+      <listitem>
+       <para>
+        This collation sorts by Unicode code point.  (This is the same
+        behavior as the libc locale specification <literal>C</literal>.)
+       </para>
+      </listitem>
+     </varlistentry>
+    </variablelist>
    </para>
   </sect3>
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..525bec4b44 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1482,10 +1482,14 @@ static void
 setup_collation(FILE *cmdfd)
 {
        /*
-        * Add an SQL-standard name.  We don't want to pin this, so it doesn't 
go
-        * in pg_collation.h.  But add it before reading system collations, so
-        * that it wins if libc defines a locale named ucs_basic.
+        * Add SQL-standard names.  We don't want to pin these, so they don't go
+        * in pg_collation.h.  But add them before reading system collations, so
+        * that they win if libc defines a locale with the same name.
         */
+       PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
+                                 "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'und');\n\n",
+                                 BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU, 
PG_UTF8);
+
        PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, 
collowner, collprovider, collisdeterministic, collencoding, collcollate, 
collctype)"
                                  "VALUES 
(pg_nextoid('pg_catalog.pg_collation', 'oid', 
'pg_catalog.pg_collation_oid_index'), 'ucs_basic', 'pg_catalog'::regnamespace, 
%u, '%c', true, %d, 'C', 'C');\n\n",
                                  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_LIBC, 
PG_UTF8);
-- 
2.39.2

Reply via email to