suibianwanwank commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2786137208

   I'm very interested in this issue, so I try a simple POC. Below are my 
observations and recommendations.
   
   If my understanding is correct, a truly linear function should satisfy the 
identity:
   
   `f(3x + 2y + 1) = 3f(x) + 2f(y) + f(1)`,
   
   However, in practice, none of the basic SQL aggregate functions—`SUM`, 
`COUNT`, `MAX`, or `MIN`—exhibit this linearity. 
   
   
   ## SUM
   
   Take the expression:
   
   `sum(3x + 2y + 1)`,
   
   we want to split it into
   
   `3sum(x) + 2sum(y) + sum(1)`
   
   This conversion is based on the assumption that both `x` and `y` are 
non-null:
   
   For any row, assuming x is null, the corresponding sum(3x + 2y + 1) should 
accumulate 0, but after splitting, sum(1) will accumulate 1
   
   To make the transformation semantically equivalent, we need to add **null 
filtering** from the original expression. A more accurate rewrite would be:
   
   ```
   3 * sum(x) filter (y is not null) + 2 * sum(y) filter (x is not null) + 
sum(1) filter (x, y is not null)
   ```
   
   Similarly, for Q29, `SUM("ResolutionWidth" + 1)` should be converted to 
`SUM(ResolutionWidth) + 1 * COUNT(ResolutionWidth)` rather than `COUNT(*)`.
   
   
   
   
   ## MAX/MIN
   
   We seem to be able to support only the most basic scenarios e.g..
   
   `max(a + 1)=max(a) + 1`
   
   Consider the following scenario, the conversions should not be equivalent.
   
   `max(1 - a) != 1 - max(a)`
   
   `max(a + b) != max(a) + max(b)`
   
   ## COUNT
   `COUNT` has almost none of the characteristics of linear, e.g.. 
   `count(x + 1) != count(x) + 1` 
   The optimization I can think of is: if all fields in expr are null and expr 
always outputs null, then we can eliminate the computation of expr except for 
the fields. i.e.  
   `count(x + 1) => count(x)`
   
   ## Conclusion
   In conclusion, unfortunately, I haven't found a universal approach to handle 
the envisioned "linear functions" as none fully meet the criteria of linear 
functions.
   
   Moreover, in my view, the optimization scenarios described above have very 
limited applicability. I'm uncertain whether such queries would occur in 
real-world scenarios. However, for Q29, the SUM optimization can provide 
significant performance improvements.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to