Re: Hash join not finding which collation to use for string hashing

Mark Dilger Thu, 30 Jan 2020 11:44:18 -0800

>> Would it make sense to catch a qual with unassigned collation
>> somewhere in the planner, where the qual's operator family is
>> estatblished, by checking if the operator family behavior is sensitive
>> to collations?

I’m not sure how to do that.  pg_opfamily doesn’t seem to have a field for 
that.  Can you recommend how I would proceed there?



There may be operators other than = and != that are worth thinking about, but 
for now I’m really only thinking about those two.  It is hard for me to see how 
we could ignore a collation mismatch for any other operator, but maybe I’m just 
not thinking about it the right way.

So, for = and !=, I’m looking at the definition of texteq, and it calls 
check_collation_set as one of the very first things it does.  That’s where the 
error that annoys me comes out.  But I don’t think it really needs to be doing 
this.  It should first be determining if collation *matters*.  It can’t do that 
right now, because it gets that information from this line:

>     if (lc_collate_is_c(collid) ||
>         collid == DEFAULT_COLLATION_OID ||
>         pg_newlocale_from_collation(collid)->deterministic)

Which obviously won’t work if collid hasn’t been set.  So three approaches come 
to my mind:

1) Somewhere upstream from calling texteq, figure out that we don’t actually 
care about the collation stuff, because both the left and right side of the 
comparison use deterministic collations and the comparison we’re calling is 
equality, and then pass down a dummy collation such as “C” even though that 
isn’t actually true.

The problem with (1) that I see is that it could lead to the wrong collation 
being mentioned in error messages, though I haven’t looked, and that it’s 
enough of a hack that it might make coding in this area harder in the future.

2) Somewhere upstream from calling texteq, pass in a new boolean flag that 
specifies whether collation matters, and extend texteq to take an additional 
argument.

This also seems very hacky to me, but for different reasons.

3) Extend the concept of collations to collation sets.  Right now, I’m only 
thinking about a collation set as having two values, the lefthand and the 
righthand side, but maybe there are other cases like (Left, (Left,Right)) that 
get built up and need to work.  Anyway, at the point in the executor that the 
collations don’t match, instead of passing NULL down the line, pass in a 
collation set (Left, Right), and functions like texteq can see that they’re 
dealing with two different collations and decide if they can deal with that or 
if they need to throw an error.

I bet if we went with (3), the error being thrown in the example I used to 
start this thread would go away, without breaking anything else.  I’m going to 
go poke at that a bit, but I’d still appreciate any comments/concerns about my 
analysis.

> According to Tom:
> 
>> (BTW, before v12 the text '=' operator indeed did not care for collation,
>> so this example would've worked.  But the change in behavior is a
>> necessary consequence of having invented nondeterministic collations,
>> not a bug.)
> 
> I’m still struggling with that, because the four collations I used in the 
> example are all deterministic.  I totally understand why having more than one 
> collation matters if you ask that your data be in sorted order, as the system 
> needs to know which ordering to use.  But for equality, I would think that 
> deterministic collations are all interchangeable, because they all agree on 
> whether A = B, regardless of the collation defined on column A and/or on 
> column B.  Maybe I’m wrong about that.  But that’s my reading of the 
> definition of “deterministic collation” given in the docs:
> 
>> A deterministic collation uses deterministic comparisons, which means that 
>> it considers strings to be equal only if they consist of the same byte 
>> sequence. 
> 
> I’m reading that as “If and only if”, and maybe I’m wrong to do so.  Maybe 
> that’s my error.  But assuming that part is ok, it would seem to be 
> sufficient to know that the columns being joined use deterministic 
> collations, and you wouldn’t need them to be the *same* collations, nor even 
> remember which collations they were.  You’d just need information passed down 
> that collations can be ignored for this comparison, or that a built-in 
> byte-for-byte equality comparator should be used rather than the collation’s 
> equality comparator, or some such solution.

I’m starting to think that “consequence of having invented nondeterministic 
collations” in Tom’s message really should read “consequence of having invented 
nondeterministic collations without reworking these other interfaces”, but once 
again, I’m hoping to be corrected if I’ve gone off in the wrong direction here.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: Hash join not finding which collation to use for string hashing

Reply via email to