suibianwanwank commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2786137208
I'm very interested in this issue, so I try a simple POC. Below are my observations and recommendations. If my understanding is correct, a truly linear function should satisfy the identity: `f(3x + 2y + 1) = 3f(x) + 2f(y) + f(1)`, However, in practice, none of the basic SQL aggregate functions—`SUM`, `COUNT`, `MAX`, or `MIN`—exhibit this linearity. ## SUM Take the expression: `sum(3x + 2y + 1)`, we want to split it into `3sum(x) + 2sum(y) + sum(1)` This conversion is based on the assumption that both `x` and `y` are non-null: For any row, assuming x is null, the corresponding sum(3x + 2y + 1) should accumulate 0, but after splitting, sum(1) will accumulate 1 To make the transformation semantically equivalent, we need to add **null filtering** from the original expression. A more accurate rewrite would be: ``` 3 * sum(x) filter (y is not null) + 2 * sum(y) filter (x is not null) + sum(1) filter (x, y is not null) ``` Similarly, for Q29, `SUM("ResolutionWidth" + 1)` should be converted to `SUM(ResolutionWidth) + 1 * COUNT(ResolutionWidth)` rather than `COUNT(*)`. ## MAX/MIN We seem to be able to support only the most basic scenarios e.g.. `max(a + 1)=max(a) + 1` Consider the following scenario, the conversions should not be equivalent. `max(1 - a) != 1 - max(a)` `max(a + b) != max(a) + max(b)` ## COUNT `COUNT` has almost none of the characteristics of linear, e.g.. `count(x + 1) != count(x) + 1` The optimization I can think of is: if all fields in expr are null and expr always outputs null, then we can eliminate the computation of expr except for the fields. i.e. `count(x + 1) => count(x)` ## Conclusion In conclusion, unfortunately, I haven't found a universal approach to handle the envisioned "linear functions" as none fully meet the criteria of linear functions. Moreover, in my view, the optimization scenarios described above have very limited applicability. I'm uncertain whether such queries would occur in real-world scenarios. However, for Q29, the SUM optimization can provide significant performance improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org