Re: [PR] fix: bugs when having and group by are all false [datafusion]

via GitHub Sun, 11 Aug 2024 08:12:45 -0700


Lordworms commented on PR #11897:
URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2282792372


   > Thank you for your work on this PR @Lordworms but I am now pretty confused 
about these changes 🤔
   > 
   > In my understanding, the bug involves he `HAVING` clause -- the `HAVING` 
clause is like the `WHERE` clause except that it happens _after_ grouping where 
the `WHERE` clause happens before grouping.
   > 
   > This 
[page](https://www.shiksha.com/online-courses/articles/order-of-execution-in-sql/#:~:text=Since%20a%20lot%20of%20people,and%20finally%2C%20LIMIT%2FOFFSET.)
 has a nice table
   > 
   > <img alt="Screenshot 2024-08-11 at 7 17 33 AM" width="895" 
src="https://private-user-images.githubusercontent.com/490673/356869670-2a9cff40-cc70-44b3-854d-973bdfff865c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzODg5ODksIm5iZiI6MTcyMzM4ODY4OSwicGF0aCI6Ii80OTA2NzMvMzU2ODY5NjcwLTJhOWNmZjQwLWNjNzAtNDRiMy04NTRkLTk3M2JkZmZmODY1Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxMVQxNTA0NDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01NWE5OTliMzRhYzFkZDViMzZkM2Q4NTZjOGVlMjI1YjkxM2E3NTgyM2UzZjU5MjRmZDhmMGU2MTljZjQ2MWJmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.2kJp1YaA_i6xt10Cq7YZkwQCxUOoSnaJpLuRcSEWCg0";>
   > So for a query like
   > 
   > ```sql
   > SELECT AVG(v1) FROM t1 GROUP BY false having false;
   > ```
   > 
   > I would expect a query plan that looks something like
   > 
   > ```
   > Filter(expr=false) <-- this is the HAVING(false)
   >   GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
   >     TableScan(t1)
   > ```
   > 
   Yes, I understand, but the gby expr is optimized into an empty set in 
Optimizer, in order to pass the gby information to ExecutionPlan, I think I can 
only add an extra boolean member?
   > When I explained the query we can see that in fact this filter is added
   > 
   > ```
   > > create table t1(v1 int) as values (1), (2);
   > 0 row(s) fetched.
   > Elapsed 0.016 seconds.
   > 
   > > explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
   > ...
   > | initial_logical_plan                                       | Projection: 
avg(t1.v1)                                                                      
                              |
   > |                                                            |   Filter: 
Boolean(false)                                                                  
                                |
   > |                                                            |     
Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]]                      
                                      |
   > |                                                            |       
TableScan: t1
   > ```
   > 
   > However, then it looks like the filter is pushed down below the group by
   > 
   > ```
   > | logical_plan after push_down_filter                        | Projection: 
avg(t1.v1)                                                                      
                              |
   > |                                                            |   
Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]]     
                                        |
   > |                                                            |     Filter: 
Boolean(false)                                                                  
                              |
   > |                                                            |       
TableScan: t1
   > ```
   > 
   > Which finally results in
   > 
   > ```
   > | logical_plan after eliminate_filter                        | Aggregate: 
groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                              
                               |
   > |                                                            |   
EmptyRelation
   > ```
   > 
   > So it seems a solution might be to refine the conditions under which 
filters can be pushed below grouping (perhaps we shouldn't push filters below 
grouping when there are no column references in the filter 🤔 )
   
   
   
   > dismissed their [stale review](#pullrequestreview-2230373615)
   
   Also in duckDB, those query do returns 0 rows.
   
![image](https://github.com/user-attachments/assets/c489eb54-eac6-411f-bc73-86bc67c151c7)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: bugs when having and group by are all false [datafusion]

Reply via email to