Re: insensitive collations

Peter Eisentraut Thu, 21 Feb 2019 00:36:45 -0800

On 2019-02-21 03:17, Peter Geoghegan wrote:
> I wonder if it would be better to break this into distinct commits?


I thought about that.  Especially the planner/executor changes could be
done separately, sort of as a way to address the thread
"ExecBuildGroupingEqual versus collations".  But I'm not sure if they
would have good test coverage on their own.  I can work on this if
people think this would be useful.

> * Why is support for non-deterministic comparisons limited to the ICU
> provider? If that's the only case that could possibly be affected,
> then why did we ever add the varstrcmp() tie-breaker call to strcmp()
> in the first place? The behavior described in the commit message of
> bugfix commit 656beff5 describes a case where Hungarian text caused
> index corruption by being strcoll()-wise equal but not bitwise equal.
> Besides, I think that you can vendor your own case insensitive
> collation with glibc, since it's based on UCA.

The original test case (described here:
https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us)
no longer works, so it was probably fixed on the glibc side.  The git
log of the hu_HU locale definition shows that it has been "fixed" in
major ways several times over the years, so that's plausible.

I tried reproducing some more practical scenarios involving combining
diacritics, but glibc apparently doesn't believe strings in different
normal forms are equal.  At that point I gave up because this doesn't
seem worthwhile to support.

Moreover, I think allowing this would require a "trusted" strxfrm(),
which is currently disabled.

>> +   return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
>> +                                         DEFAULT_COLLATION_OID,
>> +                                         oldvalue, newvalue));
>>  }
> 
> The existing restriction on ICU collations (that they cannot be
> database-wide) side-steps the issue.
> 
> * Can said restriction somehow be lifted? That seems like it would be
> a lot cleaner.

Lift what restriction?  That ICU collations cannot be database-wide?
Sure that would be nice, but it's a separate project.  Even then, I'm
not sure that we would allow a database-wide collation to be
nondeterministic.  That would for example disallow the use of LIKE,
which would be weird.  In any case, the above issue can be addressed
then.  I think it's not worth complicating this right now.

> * Why have you disable this optimization?:
> 
>>     /* Fast pre-check for equality, as discussed in varstr_cmp() */
>> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> +   if ((!sss->locale || sss->locale->deterministic) &&
>> +       len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> 
> I don't see why this is necessary. A non-deterministic collation
> cannot indicate that bitwise identical strings are non-equal.

Right, I went too far there.

> * Perhaps you should add a "Tip" referencing the feature to the
> contrib/citext documentation.

Good idea.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: insensitive collations

Reply via email to