neilconway commented on issue #15161:
URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3936788737
Alright, I spent some more time on this; I have PRs open for two possible
approaches:
1. #20426
2. #20456
Approach (1) is what I was initially planning. We prefer string -> numeric
coercions in comparison contexts, and numeric -> string coercions in contexts
like UNION and CASE. That fixes the issue raised in this ticket, but it
continues to allow queries that I think we should reject, like `WHERE text_col
< 5`, `WHERE text_col < int_col`, and `WHERE text_col IN (5, 10)`.
Approach (2) removes string <-> numeric coercions in comparison contexts
entirely. Instead, we arrange for the type coercion analysis pass to check for
situations where a numeric column is compared with a string literal, and then
insert a coercion for that literal as a special case. This is stricter than (1)
but I think it rejects more classes of dubious queries. I initially thought
that treating literals differently for type coercion seemed wrong, but on
reflection I think it's okay: we are basically approximating the idea that
string literals might not really be "string literals", they are more properly
treated as "unknown literals" and then typed based on context. This is
explicitly what Postgres does; we could consider a similar approach but IMO
that's too large of a project for right now.
I had Claude Code compare the two approaches:
```
┌───────────────────────────────┬───────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────┐
│ SQL Construct │ PR #20426
(type-based) │ PR #20456
(literal-aware) │
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ int_col < '5' │ Numeric comparison (runtime CAST)
│ Numeric comparison (planning-time literal
conversion) │
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ text_col < 5 │ Inserts CAST(text_col AS Int64) —
runtime error if data isn't numeric │ Planning-time error (type mismatch)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ text_col = int_col │ Inserts CAST(text_col AS Int64) —
runtime error if data isn't numeric │ Planning-time error (type mismatch)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ int_col IN ('5', '10') │ Coerces to Utf8 (uses
type_union_coercion) — lexicographic comparison │ Converts literals to Int64 at
planning time — numeric comparison │
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ text_col IN (5, 10) │ Coerces to Utf8 — string comparison,
succeeds │ Planning-time error
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ int_col BETWEEN '5' AND '100' │ Numeric (runtime CAST)
│ Numeric (planning-time literal conversion)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ text_col BETWEEN 1 AND 10 │ Runtime error (CASTs column to Int64)
│ Planning-time error
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ '5' BETWEEN 1 AND 10 │ true (runtime CAST)
│ true (planning-time literal conversion)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ int_col < 'hello' │ Runtime error (CAST fails)
│ Planning-time error (same message)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ int_col = '99.99' │ Runtime error
│ Planning-time error
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ UNION int + string │ Coerces to Utf8
│ Coerces to Utf8 (same)
│
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
│ CASE mixed types │ Coerces to Utf8
│ Coerces to Utf8 (same)
│
└───────────────────────────────┴───────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┘
```
If folks have an opinion on which approach they think is better (or would
like to suggest an alternative), I'd love to hear it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]