On 2019-02-21 03:17, Peter Geoghegan wrote: > I wonder if it would be better to break this into distinct commits?
I thought about that. Especially the planner/executor changes could be done separately, sort of as a way to address the thread "ExecBuildGroupingEqual versus collations". But I'm not sure if they would have good test coverage on their own. I can work on this if people think this would be useful. > * Why is support for non-deterministic comparisons limited to the ICU > provider? If that's the only case that could possibly be affected, > then why did we ever add the varstrcmp() tie-breaker call to strcmp() > in the first place? The behavior described in the commit message of > bugfix commit 656beff5 describes a case where Hungarian text caused > index corruption by being strcoll()-wise equal but not bitwise equal. > Besides, I think that you can vendor your own case insensitive > collation with glibc, since it's based on UCA. The original test case (described here: https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us) no longer works, so it was probably fixed on the glibc side. The git log of the hu_HU locale definition shows that it has been "fixed" in major ways several times over the years, so that's plausible. I tried reproducing some more practical scenarios involving combining diacritics, but glibc apparently doesn't believe strings in different normal forms are equal. At that point I gave up because this doesn't seem worthwhile to support. Moreover, I think allowing this would require a "trusted" strxfrm(), which is currently disabled. >> + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo, >> + DEFAULT_COLLATION_OID, >> + oldvalue, newvalue)); >> } > > The existing restriction on ICU collations (that they cannot be > database-wide) side-steps the issue. > > * Can said restriction somehow be lifted? That seems like it would be > a lot cleaner. Lift what restriction? That ICU collations cannot be database-wide? Sure that would be nice, but it's a separate project. Even then, I'm not sure that we would allow a database-wide collation to be nondeterministic. That would for example disallow the use of LIKE, which would be weird. In any case, the above issue can be addressed then. I think it's not worth complicating this right now. > * Why have you disable this optimization?: > >> /* Fast pre-check for equality, as discussed in varstr_cmp() */ >> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0) >> + if ((!sss->locale || sss->locale->deterministic) && >> + len1 == len2 && memcmp(a1p, a2p, len1) == 0) > > I don't see why this is necessary. A non-deterministic collation > cannot indicate that bitwise identical strings are non-equal. Right, I went too far there. > * Perhaps you should add a "Tip" referencing the feature to the > contrib/citext documentation. Good idea. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services