Hi,

As [ticket] describes, UDF (scalar, aggregate, window
functions) equality/hash_value default implementation is easy to miss and
therefore error-prone.

The error-proneness is a risk, which naturally is subjective. As [fix-eq]
showed, this risk has materialized many times over even within DataFusion
code base, likely leading to query failures and incorrect results (via
common subexpression elimination). My assumption is that 3rd party UDF
implementations might also be affected in similarly large quantities. I
myself became aware of this only after seeing some really bogus query
outcomes in a project building on  DataFusion.

There are two known ways to address this problem

1. fix the default implementation to be safe ([pr-bc]). This has the
downside of disabling common subexpression elimination for queries that
benefit from it today

2. require explicit implementation, potentially making it very easy to
implement it with #derive [derive]. This has the downside of being an API
breaking change, requiring addition of these #derive lines.

Please leave your thoughts in [ticket]

Best,
PF


[ticket] https://github.com/apache/datafusion/issues/16677
[fix-eq] https://github.com/apache/datafusion/pull/16781
[pr-bc] https://github.com/apache/datafusion/pull/16681
[derive]
https://github.com/apache/datafusion/issues/16677#issuecomment-3092338265

Reply via email to