tobixdev opened a new issue, #17488:
URL: https://github.com/apache/datafusion/issues/17488

   ### Describe the bug
   
   As part of trying to update [RDF 
Fusion](https://github.com/tobixdev/rdf-fusion) to DataFusion 50, we observed a 
significant performance regression for a query that makes use of a Nested Loop 
Join. 
   
   [Original 
comment](https://github.com/apache/datafusion/issues/16799#issuecomment-3270869325)
 was in the Release issue for DataFusion 50. 
   
   I think the regression foots on two points:
   - Apparently `build_row_join_batch` calls `ScalarValue::to_array_of_size` 
and creates a `UnionArray` which seems to be slow.
   - Furthermore, much more time seems to be spent during evaluation of 
expressions
   
   This could be related to https://github.com/apache/datafusion/pull/16996
   
   @2010YOUY01 do you have a take on this?
   
   
   ### To Reproduce
   
   I don't have a reproducer with the DataFusion CLI. Below is a part of our 
execution plan that causes the problem. I know its tough to read without 
knowing the system and the filters are rather messy.
   
   If we cannot triage the regression with this information I can try to come 
up with a custom program. I think we need two ingredients: Union values and 
complex filter expressions. 
   
   ```
   NestedLoopJoinExec: join_type=Inner, 
filter=coalesce(coalesce(EBV(LT(join_proj_push_down_6@0, 
join_proj_push_down_8@2)), false) AND coalesce(EBV(GT(join_proj_push_down_7@1, 
join_proj_push_down_9@3)), false), false), projection=[product@0, 
productLabel@1]
   ```
   
   Column Types:
   
   - `join_proj_push_down_6`: Large Union Type
   - `join_proj_push_down_8`: Large Union Type
   - `join_proj_push_down_7`: Large Union Type
   - `join_proj_push_down_9`: Large Union Type
   - `product`: `UInt32`
   - `productLabel`: `UInt32`
   
   ### Expected behavior
   
   Similar performance to DataFusion 49
   
   ### Additional context
   
   Here is a flamegraph of the query sub plan on DataFusion 49 (Total Time: 4.2 
ms):
   
   <img width="1633" height="646" alt="Image" 
src="https://github.com/user-attachments/assets/9233d33b-9dbf-4663-b93c-e703aa2e1efe";
 />
   
   Here is a flamegraph of the query sub plan on DataFusion 50 (Total Time: 
190.2 ms):
   
   <img width="1633" height="646" alt="Image" 
src="https://github.com/user-attachments/assets/d485d6f6-68a8-4b6d-ad06-e6b24ae5910b";
 />
   
   There is also an interactive view on 
[CodSpeed](https://codspeed.io/tobixdev/rdf-fusion/branches/feature%2Fupdate-df-50?uri=bench%2Fbenches%2Fbsbm_explore.rs%3A%3Absbm_explore%3A%3Absbm_explore_10000_1_partition%3A%3ABSBM%2520Explore%252010000%2520%28target_partitions%3D1%29%2520-%2520Q5).
 You can switch between Base (DF 49) and Head (DF 50). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to