We use system UTF-16 collation to implement UTF-8 collation on Windows. The PostgreSQL security team received a report, from Timothy Kuun, that this collation does not uphold the "symmetric law" and "transitive law" that we require for btree operator classes. The attached test program demonstrates this. http://www.delphigroups.info/2/62/478610.html quotes reports of that problem going back eighteen years. Most code points are unaffected. Indexing an affected code point using such a collation can cause btree index scans to not find a row they should find and can make a UNIQUE or PRIMARY KEY constraint admit a duplicate. The security team determined that this doesn't qualify as a security vulnerability, but it's still a bug.
All I can think to do is issue a warning whenever a CREATE DATABASE or CREATE COLLATION combines UTF8 encoding with a locale having this problem. In a greenfield, I would forbid affected combinations of encoding and locale. That is too harsh, considering the few code points affected and the difficulty of changing the collation of existing databases. For CREATE DATABASE, all except LOCALE=C would trigger the warning. For CREATE COLLATION, ICU locales would also not trigger the warning. Hence, the chief workaround is to use LOCALE=C at the database level and ICU collations for indexes and operator invocations. (The ability to use an ICU collation at the database level would improve the user experience here.) Better ideas?
#include <locale.h> #include <windows.h> #include <stdio.h> #include <wchar.h> #include <winnls.h> _locale_t glocale; void p(const wchar_t *bp) { while (*bp) printf("%hx ", *bp++); puts(""); } char lt(wchar_t *a, wchar_t *b) { int result; errno = 0; result = glocale ? _wcscoll_l(a, b, glocale) : wcscoll(a, b); if (errno != 0) puts("wcscoll(_l) failed"); return result; } BOOL CALLBACK cb(LPWSTR locale, DWORD flags, LPARAM unused) { wchar_t s1[] = { 0x11a7, 0x1188, 0xd7a2, 0x0 }; wchar_t s2[] = { 0x11a7, 0xd7a2, 0x1188, 0x0 }; wchar_t s3[] = { 0xd7a2, 0x11a7, 0x1188, 0x0 }; p(s1); p(s2); p(s3); if (_wsetlocale(LC_ALL, locale)) { /* int ordinary = */ /* cmp(s1, s3) < 0 && cmp(s3, s4) < 0 && cmp(s1, s4) < 0; */ /* verdict = ordinary ? "typical" : "unusual"; */ if (glocale) _free_locale(glocale); glocale = NULL; printf("%S: %d %d %d\n", locale, lt(s1, s2), lt(s2, s3), lt(s3, s1)); glocale = _wcreate_locale(LC_ALL, locale); if (!glocale) puts("_wcreate_locale failed"); printf("%S: %d %d %d\n", locale, lt(s1, s2), lt(s2, s3), lt(s3, s1)); } else printf("%S: setlocale failed\n", locale); return TRUE; } int main(int argc, char **argv) { EnumSystemLocalesEx(cb, 0, (LPARAM) NULL, NULL); return 0; }