neilconway commented on issue #15161:
URL: https://github.com/apache/datafusion/issues/15161#issuecomment-3936788737

   Alright, I spent some more time on this; I have PRs open for two possible 
approaches:
   
   1. #20426 
   2. #20456 
   
   Approach (1) is what I was initially planning. We prefer string -> numeric 
coercions in comparison contexts, and numeric -> string coercions in contexts 
like UNION and CASE. That fixes the issue raised in this ticket, but it 
continues to allow queries that I think we should reject, like `WHERE text_col 
< 5`, `WHERE text_col < int_col`, and `WHERE text_col IN (5, 10)`.
   
   Approach (2) removes string <-> numeric coercions in comparison contexts 
entirely. Instead, we arrange for the type coercion analysis pass to check for 
situations where a numeric column is compared with a string literal, and then 
insert a coercion for that literal as a special case. This is stricter than (1) 
but I think it rejects more classes of dubious queries. I initially thought 
that treating literals differently for type coercion seemed wrong, but on 
reflection I think it's okay: we are basically approximating the idea that 
string literals might not really be "string literals", they are more properly 
treated as "unknown literals" and then typed based on context. This is 
explicitly what Postgres does; we could consider a similar approach but IMO 
that's too large of a project for right now.
   
   I had Claude Code compare the two approaches:
   
   ```
   
     
┌───────────────────────────────┬───────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────┐
     │         SQL Construct         │                        PR #20426 
(type-based)                         │                    PR #20456 
(literal-aware)                     │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ int_col < '5'                 │ Numeric comparison (runtime CAST)        
                             │ Numeric comparison (planning-time literal 
conversion)            │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ text_col < 5                  │ Inserts CAST(text_col AS Int64) — 
runtime error if data isn't numeric │ Planning-time error (type mismatch)       
                       │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ text_col = int_col            │ Inserts CAST(text_col AS Int64) — 
runtime error if data isn't numeric │ Planning-time error (type mismatch)       
                       │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ int_col IN ('5', '10')        │ Coerces to Utf8 (uses 
type_union_coercion) — lexicographic comparison │ Converts literals to Int64 at 
planning time — numeric comparison │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ text_col IN (5, 10)           │ Coerces to Utf8 — string comparison, 
succeeds                         │ Planning-time error                          
                    │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ int_col BETWEEN '5' AND '100' │ Numeric (runtime CAST)                   
                             │ Numeric (planning-time literal conversion)       
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ text_col BETWEEN 1 AND 10     │ Runtime error (CASTs column to Int64)    
                             │ Planning-time error                              
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ '5' BETWEEN 1 AND 10          │ true (runtime CAST)                      
                             │ true (planning-time literal conversion)          
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ int_col < 'hello'             │ Runtime error (CAST fails)               
                             │ Planning-time error (same message)               
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ int_col = '99.99'             │ Runtime error                            
                             │ Planning-time error                              
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ UNION int + string            │ Coerces to Utf8                          
                             │ Coerces to Utf8 (same)                           
                │
     
├───────────────────────────────┼───────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
     │ CASE mixed types              │ Coerces to Utf8                          
                             │ Coerces to Utf8 (same)                           
                │
     
└───────────────────────────────┴───────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┘
   ```
   
   If folks have an opinion on which approach they think is better (or would 
like to suggest an alternative), I'd love to hear it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to