Tom Lane wrote:
Another objection to this design is that it's completely unclear that
functions from text to text should necessarily yield the same collation
that went into them, but if you treat collation as a hard-wired part of
the expression syntax tree you aren't going to be able to do anything else.
(What will you do about functions/operators taking more than one text
argument?)

Whatever the spec says. Collation is intimately associated with the comparison operations, and doesn't make any sense anywhere else. The way the default collation for a given operation is determined, by bubbling up the collation from the operands, through function calls and other expressions, is just to make life a bit easier for the developer who's writing the SQL. We could demand that you always explicitly specify a collation when you use the text equality or inequality operators, but because that would be quite tiresome, a reasonable default is derived from the context.

I believe the spec stipulates how that default is derived, so I don't think we need to fret over it. We'll need it eventually, but the parser changes is not the critical part. We can start off by deriving the collation from a GUC variable, for example.

I think it would be better to treat the collation indicator as part of
string *values* and let it bubble up through expressions that way.
The "expr COLLATE ident" syntax would be a simple run-time operation
that pokes a new collation into a string value.  The notion of a column
having a particular collation would then amount to a check constraint on
the values going into the column.

Looking at an individual value, collation just doesn't make sense.
Collation is property of the comparison operation, not of a value.

In the parser, we might have to do something like that though, because according to the standard you can tack the COLLATION keyword to string constants and have it bubble up. But let's keep that ugliness just inside the parser.

One, impractical, way to implement collation would be to have one operator class per collation. In fact you could do that today, with no backend changes, to support multiple collations. It's totally impractical, because for starters you'd need different comparison operators, with different names, for each collation. But it's the right mental model.

I think the right approach is to invent a new concept called "operator modifier". It's basically a 3rd argument to operators. It can be specified explicitly when an operator is used, with syntax like "<left> Op <right> USING <modifier>", or in case of collation, it's derived from the context, per SQL spec. The operator modifier is tacked on to OpExprs and SortClauses in the parser, and passed as a 3rd argument to the function implementing the operator at execution time.

When an index is created, if the operators in the operator class take an operator modifier, it's stored at creation time into a new column in pg_index (needs to be a vector or array to handle multi-column indexes). The planner needs to check the modifier when it determines whether an index can be used or not.

BTW, this reminds me of the discussions we had about the tsearch default configuration. It's different, though, because in full text search, there's a separate tsvector data type, and the problem was with expression indexes, not regular ones.

Another consideration is LC_CTYPE. Just like we want to support
different collations, we should support different character
classifications for upper()/lower(). We might want to tie it into
collation, as using different ctype and collation doesn't usually make
sense, but it's something to keep in mind.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to