Re: [I] Expose to `GroupsAccumulator` whether all the groups are sorted [datafusion]

via GitHub Wed, 12 Mar 2025 08:31:29 -0700


asubiotto commented on issue #14991:
URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2718168220


   I'm playing around with some SQL queries and came across a situation where 
an ordered aggregation could be used but it is not, leading to very high memory 
usage. Concretely, I have the need to unnest a list and assign the same row id 
to all elements of this list and reconstruct it downstream with `array_agg`. 
Here is a simplified example with the associated explain plan:
   ```
   EXPLAIN WITH unnested AS (SELECT
       ROW_NUMBER() OVER () AS generated_id,
       unnest(array[value]) as ar
     FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id;
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: array_agg(unnested.ar)                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   |               |   Aggregate: groupBy=[[unnested.generated_id]], 
aggr=[[array_agg(unnested.ar)]]                                                 
                                                                                
                                                                                
                                                                                
                                                                      |
   |               |     SubqueryAlias: unnested                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   |               |       Projection: generated_id, 
__unnest_placeholder(make_array(tmp_table.value),depth=1) AS 
UNNEST(make_array(tmp_table.value)) AS ar                                       
                                                                                
                                                                                
                                                                                
                         |
   |               |         Unnest: 
lists[__unnest_placeholder(make_array(tmp_table.value))|depth=1] structs[]      
                                                                                
                                                                                
                                                                                
                                                                                
                      |
   |               |           Projection: row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING AS generated_id, make_array(tmp_table.value) 
AS __unnest_placeholder(make_array(tmp_table.value))                            
                                                                                
                                                                      |
   |               |             WindowAggr: windowExpr=[[row_number() ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]                           
                                                                                
                                                                                
                                                                                
                                                               |
   |               |               TableScan: tmp_table projection=[value]      
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   | physical_plan | ProjectionExec: expr=[array_agg(unnested.ar)@1 as 
array_agg(unnested.ar)]                                                         
                                                                                
                                                                                
                                                                                
                                                                    |
   |               |   AggregateExec: mode=FinalPartitioned, 
gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)]             
                                                                                
                                                                                
                                                                                
                                                                              |
   |               |     CoalesceBatchesExec: target_batch_size=8192            
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   |               |       RepartitionExec: partitioning=Hash([generated_id@0], 
12), input_partitions=12                                                        
                                                                                
                                                                                
                                                                                
                                                           |
   |               |         AggregateExec: mode=Partial, gby=[generated_id@0 
as generated_id], aggr=[array_agg(unnested.ar)]                                 
                                                                                
                                                                                
                                                                                
                                                             |
   |               |           ProjectionExec: expr=[generated_id@1 as 
generated_id, __unnest_placeholder(make_array(tmp_table.value),depth=1)@2 as 
ar]                                                                             
                                                                                
                                                                                
                                                                       |
   |               |             UnnestExec                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   |               |               ProjectionExec: expr=[row_number() ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as row_number() ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as generated_id, 
make_array(value@0) as __unnest_placeholder(make_array(tmp_table.value))]       
                                                                                
       |
   |               |                 RepartitionExec: 
partitioning=RoundRobinBatch(12), input_partitions=1                            
                                                                                
                                                                                
                                                                                
                                                                                
     |
   |               |                   BoundedWindowAggExec: wdw=[row_number() 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: 
"row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", 
data_type: UInt64, nullable: false, dict_id: 0, dict_is_ordered: false, 
metadata: {} }), frame: WindowFrame { units: Rows, start_bound: 
Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false 
}], mode=[Sorted] |
   |               |                     LazyMemoryExec: partitions=1, 
batch_generators=[range: start=1, end=5, batch_size=8192]                       
                                                                                
                                                                                
                                                                                
                                                                    |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                           |
   
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 row(s) fetched.
   Elapsed 0.010 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Expose to `GroupsAccumulator` whether all the groups are sorted [datafusion]

Reply via email to