colinmarc opened a new issue, #17970:
URL: https://github.com/apache/datafusion/issues/17970
### Describe the bug
Hi!
We're trying to support an unnamed BI tool on top of datafusion, which is
fond of generating symmetric aggregates that look like this:
```sql
SELECT (
COALESCE(
(
SUM(DISTINCT
(CAST(FLOOR(COALESCE( o_totalprice , 0) * 1000000)
AS DECIMAL(65,0)))
+ _hex_to_bigint('x' || MD5( l_orderkey::varchar
))::DECIMAL(65,0)
* 18446744073709551616
+ _hex_to_bigint('x' || SUBSTR(MD5(
l_orderkey::varchar ) , 17))::DECIMAL(65,0)
)
- SUM(DISTINCT
_hex_to_bigint('x' || MD5( l_orderkey::varchar
))::DECIMAL(65,0)
* 18446744073709551616
+ _hex_to_bigint('x' || SUBSTR(MD5(
l_orderkey::varchar ), 17))::DECIMAL(65,0)
)
) / 1000000, 0) -- coalesce
) / COUNT(DISTINCT l_orderkey) as average_total
FROM orders left join lineitem on (l_orderkey = o_orderkey)
```
Here is some documentation from a totally different, 100% unrelated BI tool
that explains why the above SQL is so cursed:
https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates
Roughly what it's doing is computing a really big sum twice, and subtracting
it from itself. The numbers it's summing are hashes of a primary key. To
generate big unique numbers, it's doing two `md5` hashes and bitshifting the
first to the left.
Unfortunately, the results from DF seem to be nondeterministic, and often
negative. The issue seems to be with floating point math, related to the
literal `18446744073709551616` (2^64). The BI tool seems to be expecting that
the literal will be typed as a `DECIMAL(65, 0)`, since the left hand side of
the multiplication is also `DECIMAL(65, 0)`, but DF is picking that up as a
float:
```
<snip> AS Decimal256(65, 0)) * Float64(18446744073709552000) <snip>
```
This is especially weird because the literal in the plan is a completely
different number. 😕
To aid in debugging, I've included a small reproducer here:
https://github.com/colinmarc/df-symmetric-aggregate
Note that if you change the literal in the generated SQL to
`18446744073709551615` (`2^64 - 1`), it gets picked up as a `UInt64`, which is
still odd, but works, produces the correct value, and is deterministic.
Thanks in advance!
### To Reproduce
```
git clone https://github.com/colinmarc/df-symmetric-aggregate
cd df-symmetric-aggregate
cargo test
```
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]