Re: Collation versioning

Julien Rouhaud Thu, 12 Dec 2019 06:37:20 -0800

Hello Thomas,

Thanks for looking at it!


On Thu, Dec 12, 2019 at 5:01 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
>
> We create duplicate pg_depend records:
>
> [...]
>
> I wondered if that was harmless

That's the assumed behavior of recordMultipleDependencies:

/*
* Record the Dependency.  Note we don't bother to check for
* duplicate dependencies; there's no harm in them.
*/

We could add a check to skip duplicates for the "track_version ==
true" path, or switch to flags if we want to also skip duplicates in
other cases, but it'll make recordMultipleDependencies a little bit
more specialised.

> but for one thing it causes duplicate warnings:

Yes, that should be avoided.

> Here's another way to get a duplicate, and in this example you also
> get an unnecessary dependency on 100 "default" for this index:
>
> postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
> collate "fr_FR";
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
>     1259 | 16460 |        0 |       3456 |    12603 |           0 |
> 0:34.0        | n
>     1259 | 16460 |        0 |       3456 |    12603 |           0 |
> 0:34.0        | n
>     1259 | 16460 |        0 |       3456 |      100 |           0 |
> 0:34.0        | n
> (3 rows)
>
> Or... maybe 100 should be there, by simple analysis of the x in the
> WHERE clause, but it's the same if you write x collate "fr_FR" >
> 'helicopter' collate "fr_FR", and in that case there are no
> expressions of collation "default" anywhere.

Ah good point.  That's because expression_tree_walker() will dig into
CollateExpr->args and eventually reach the underlying Var.  I don't
see an easy way to avoid that while still properly recording the
required dependency for an even more realistic index such as

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" >
'helicopter' COLLATE "en_US")::text) collate "fr_FR";

and for instance not for

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" ||
'helicopter' COLLATE "en_US")) collate "fr_FR";


> The indirection through composite types works nicely:
>
> postgres=# create type foo_t as (en text collate "en_CA", fr text
> collate "fr_CA");
> CREATE TYPE
> postgres=# create table t (foo foo_t);
> CREATE TABLE
> postgres=# create index on t(foo);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
>     1259 | 16444 |        0 |       3456 |    12554 |           0 |
> 0:34.0        | n
>     1259 | 16444 |        0 |       3456 |    12597 |           0 |
> 0:34.0        | n
> (2 rows)
>
> ... but again it shows the extra and technically unnecessary
> dependencies (only 12603 "fr_FR" is really needed):
>
> postgres=# create index on t(((foo).fr collate "fr_FR"));
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
>     1259 | 16445 |        0 |       3456 |    12603 |           0 |
> 0:34.0        | n
>     1259 | 16445 |        0 |       3456 |    12597 |           0 |
> 0:34.0        | n
>     1259 | 16445 |        0 |       3456 |    12554 |           0 |
> 0:34.0        | n
> (3 rows)

Yes :(

> I check that nested types are examined recursively, as appropriate.  I
> also tested domains, arrays, arrays of domains, expressions extracting
> an element from an array of a domain with an explicit collation, and
> the only problem I could find was more ways to get duplicates.  Hmm...
> what else is there that can contain a collatable type...?  Ranges!
>
> postgres=# create type myrange as range (subtype = text);
> CREATE TYPE
> postgres=# drop table t;
> DROP TABLE
> postgres=# create table t (x myrange);
> CREATE TABLE
> postgres=# create index on t(x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> (0 rows)
>
> ... or perhaps, more realistically, a GIST index might actually be
> useful for range queries, and we're not capturing the dependency:
>
> postgres=# create index t_x_idx on t using gist (x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> ---------+-------+----------+------------+----------+-------------+---------------+---------
> (0 rows)

Good catch :) I fixed it locally and checked that a gist index on a
range with a subtype being a composite type does record the required
dependencies.

> The new syntax "ALTER INDEX i_name DEPENDS ON ANY COLLATION UNKNOWN
> VERSION" doesn't sound good to me, it's not "ANY" collation, it's a
> specific set of collations that we aren't listing.  "ALTER INDEX
> i_name DEPENDS ON COLLATION * VERSION UNKNOWN", hrmph, no that's
> terrible... I'm not sure what would be better.

Mmm, indeed.  With a 3rd round in the existing keyword, how about
"DEPENDS ON [ ANY ] REFERENCING COLLATION"?  The ANY is mostly to
avoid the need for plural.

> I'm not sure if I like the idea of VACUUM reporting warnings or not.  Hmm.

Even if I add this in a IsAutoVacuumWorkerProcess?

> To state more explicitly what's happening here, we're searching the
> expression trees for subexpresions that have a collation as part of
> their static type.  We don't know which functions or operators are
> actually affected by the collation, though.  For example, if an
> expression says "x IS NOT NULL" and x happens to be a subexpression of
> a type with a particular collation, we don't now that this
> expression's value can't possibly be affected by the collation version
> changing.  So, the system will nag you to rebuild an index just
> because you mentioned it, even though the index can't be corrupted.
> To do better than that, I suppose we'd need declarations in the
> catalog to say which functions/operators are collation sensitive.

Wouldn't that still be a problem for an absurd expression like

WHERE length((val collate "en_US" > 'uh' collate "en_US")::text) > 0


And since we would still have to record a dependency on the collation
in such case, we would need to have another magic value to distinguish
"unknown" from "cannot cause corruption" collation version.

Re: Collation versioning

Reply via email to