Re: Why do indexes and sorts use the database collation?

Jeff Davis Mon, 13 Nov 2023 17:59:10 -0800

On Mon, 2023-11-13 at 22:36 +0100, Tomas Vondra wrote:
> Yeah. I don't quite agree with the initial argument that not
> specifying
> the collation explicitly in CREATE TABLE or a query means the user
> does
> not care about the collation.


I didn't argue that the user doesn't care about collation -- we need to
honor the collation semantics of the column. And a column with
unspecified collation must be the database collation (otherwise what
would the database collation mean?). But the index's collation is an
implementation detail that is not necessary to provide the requested
semantics.

I'm arguing that pathkeys are often not even useful for providing the
requested semantics, so why should the user have the pain of poor
performance and versioning risks for every text index in their system?
If the user just wants PK/FK constraints, and equality lookups, then an
index with the "C" collation makes a lot of sense to serve those
purposes.

> For example, I don't see how we could arbitrarily override the
> collation
> for indexes backing primary keys, because how would you know the user
> will never do a sort on it?

The column collation and index collation are tracked separately in the
catalog. The column collation cannot be overridden because it's
semantically signficant, but there are at least some degrees of freedom
we have with the index collation.

I don't think we can completely change the default index collation to
be "C", but perhaps there could be a database-level option to do so,
and that would have no effect on semantics at all. If the user notices
some queries that could benefit from an index with a non-"C" collation,
they can add/replace an index as they see fit.

>  Not that uncommon with natural primary keys,
> I think (not a great practice, but people do that).

Natural keys often have an uncorrelated index, and if the index is not
correlated, it's often not useful ORDER BY.

When I actually think about schemas and plans I've seen in the wild, I
struggle to think of many cases that would really benefit from an index
in a non-"C" collation. The best cases I can think of are where it's
doing some kind of prefix search. That's not rare, but it's also not so
common that I'd like to risk index corruption on every index in the
system by default in case a prefix search is performed.

> Perhaps we could allow the PK index to have a different collation,
> say
> by supporting something like this:
> 
>   ALTER TABLE distributors ADD PRIMARY KEY (dist_id COLLATE "C");

Yes, I'd like something like that to be supported. We'd have to check
that, if the collations are different, that both are deterministic.

> And then the planner would just pick the right index, I think.

Right now the planner doesn't seem to understand that an index in the
"C" collation works just fine for answering equality queries. That
should be fixed.

> If the
> user cares about ordering, they'll specify ORDER BY with either an
> explicit or the default collation. If the index collation matches, it
> may be useful for the ordering.

Exactly.

> Of course, if we feel entitled to create the primary key index with a
> collation of our choosing, that'd make this unpredictable.

I wouldn't describe it as "unpredictable". We'd have some defined way
of defaulting the collation of an index which might be affected by a
database option or something. In any case, it would be visible with \d.

Regards,
        Jeff Davis

>

Re: Why do indexes and sorts use the database collation?

Reply via email to