On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: > any byte sequence at all is accepted when you try to > put values into the database.
We support SQL_ASCII, which allows something similar. > At any rate, if we were to go in the direction of rejecting code > points that aren't yet assigned, or aren't yet known to the collation > library, that's another way for data loading to fail. A failure during data loading is either a feature or a bug, depending on whether you are the one loading the data or the one trying to make sense of it later ;-) > Which feels like > very defensible behavior, but not what everyone wants, or is used to. Yeah, there are many reasons someone might want to accept unassigned code points. An obvious one is if their application is on a newer version of unicode where the codepoint *is* assigned. > > The fact that there are multiple types of normalization and multiple > notions of equality doesn't make this easier. NFC is really the only one that makes sense. NFD is semantically the same as NFC, but expanded into a larger representation. NFKC/NFKD are based on a more relaxed notion of equality -- kind of like non-deterministic collations. These other forms might make sense in certain cases, but not general use. I believe that having a kind of text data type where it's stored in NFC and compared with memcmp() would be a good place for many users to be - - probably most users. It's got all the performance and stability benefits of memcmp(), with slightly richer semantics. It's less likely that someone malicious can confuse the database by using different representations of the same character. The problem is that it's not universally better for everyone: there are certainly users who would prefer that the codepoints they send to the database are preserved exactly, and also users who would like to be able to use unassigned code points. Regards, Jeff Davis