Re: Hash join not finding which collation to use for string hashing

Mark Dilger Thu, 30 Jan 2020 13:16:19 -0800

> On Jan 30, 2020, at 12:29 PM, Tom Lane <[email protected]> wrote:
> 
> Mark Dilger <[email protected]> writes:
>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?
> 
>> I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for 
>> that.  Can you recommend how I would proceed there?
> 
> There's no such information attached to opfamilies, which is more or less
> forced by the fact that individual operators don't expose it either.
> There's not much hope of making that better without incompatible changes
> in the requirements for extensions to define operators and/or operator
> families.

Thanks, Tom, for confirming this.

Given the excellent explanations you and Robert have given, I think I’m 
retracting this whole idea and accepting your positions that it’s not worth it.

For the archives, I’m still going to respond to the rest of what you say:

>> So, for = and !=, I’m looking at the definition of texteq, and it calls 
>> check_collation_set as one of the very first things it does.  That’s where 
>> the error that annoys me comes out.  But I don’t think it really needs to be 
>> doing this.  It should first be determining if collation *matters*.
> 
> But of course it matters.  How do you know whether the operation is
> deterministic if you don't know the collation?
> 
>> 3) Extend the concept of collations to collation sets.  Right now, I’m only 
>> thinking about a collation set as having two values, the lefthand and the 
>> righthand side, but maybe there are other cases like (Left, (Left,Right)) 
>> that get built up and need to work.  Anyway, at the point in the executor 
>> that the collations don’t match, instead of passing NULL down the line, pass 
>> in a collation set (Left, Right), and functions like texteq can see that 
>> they’re dealing with two different collations and decide if they can deal 
>> with that or if they need to throw an error.
> 
> Maybe this could work.  I think it would get messy when bubbling up
> collations, but as long as you're talking about "sets" not "pairs"
> it might be possible to postpone collation resolution.
> 
> To me, though, the main advantage of this is that we could throw a
> more explicit error like "collations "ja_JP" and "tr_TR" cannot be
> unified", since that information would still be there at runtime.
> I'm still pretty dubious that having texteq special-case the situation
> where the collations are different but all deterministic is a reasonable
> thing to do.

On my mac, when I run “SELECT * FROM pg_collation”, every one of the 271 rows I 
get back have collisdeterministic true. I know that which collations you get on 
a system is variable, so I’m not saying that nobody has nondeterministic 
collations, but it seems common enough that mismatched collations will both be 
deterministic.  That’s the common case, not some weird edge case.

So the issue here seems to be whether equality should get different treatment 
from other operators, and I obviously am arguing that it should, though you and 
Robert have both made really good points against that position.

> 
> One practical problem is that postponing that work to runtime could be
> a huge performance hit, because you'd have to do it over again on each
> call of the operator.  I suppose some caching might be possible.

Yes, Robert mentioned performance implications, too.

> Another issue is that you're still putting far too much emphasis on
> the fact that a hash-join plan manages to avoid this error, and ignoring
> the problem that a lot of other plans for the same query will not avoid
> it.  What if the planner had chosen a merge-join, for instance?  

You’re looking at the problem from the point of view of how postgres is 
currently and historically implemented, and seeing that this problem is hard.  
I was looking at it more from the perspective of a user who gets the error 
message and thinks, “this is stupid, the query is refusing to run for want of a 
collation being specified, but I can clearly see that it doesn’t actually need 
one.”  I think that same reaction from the user would happen if the planner 
chose a merge-join.  The user would just say, “gee, what a stupid planner, why 
did it choose a merge join for this when the lack of a collation clause clearly 
indicates that it should have limited itself to something that only needs 
equality comparison.”

I’m not calling the planner stupid, nor the system generally, but I know that 
people get frustrated with systems that have unintuitive limitations like this 
when they don’t know the internals of the system that give rise to the 
limitations.

> How
> useful is it to allow the join if things still break the moment you
> add an ORDER BY?

I think that’s apples-to-oranges.  If you ask the system to order the data, and 
you’ve got ambiguity about which ordering you mean, then of course it can’t 
continue until you tell it which collation you want.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: Hash join not finding which collation to use for string hashing

Reply via email to