asubiotto commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2718168220
I'm playing around with some SQL queries and came across a situation where an ordered aggregation could be used but it is not, leading to very high memory usage. Concretely, I have the need to unnest a list and assign the same row id to all elements of this list and reconstruct it downstream with `array_agg`. Here is a simplified example with the associated explain plan: ``` EXPLAIN WITH unnested AS (SELECT ROW_NUMBER() OVER () AS generated_id, unnest(array[value]) as ar FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id; +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: array_agg(unnested.ar) | | | Aggregate: groupBy=[[unnested.generated_id]], aggr=[[array_agg(unnested.ar)]] | | | SubqueryAlias: unnested | | | Projection: generated_id, __unnest_placeholder(make_array(tmp_table.value),depth=1) AS UNNEST(make_array(tmp_table.value)) AS ar | | | Unnest: lists[__unnest_placeholder(make_array(tmp_table.value))|depth=1] structs[] | | | Projection: row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS generated_id, make_array(tmp_table.value) AS __unnest_placeholder(make_array(tmp_table.value)) | | | WindowAggr: windowExpr=[[row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] | | | TableScan: tmp_table projection=[value] | | physical_plan | ProjectionExec: expr=[array_agg(unnested.ar)@1 as array_agg(unnested.ar)] | | | AggregateExec: mode=FinalPartitioned, gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)] | | | CoalesceBatchesExec: target_batch_size=8192 | | | RepartitionExec: partitioning=Hash([generated_id@0], 12), input_partitions=12 | | | AggregateExec: mode=Partial, gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)] | | | ProjectionExec: expr=[generated_id@1 as generated_id, __unnest_placeholder(make_array(tmp_table.value),depth=1)@2 as ar] | | | UnnestExec | | | ProjectionExec: expr=[row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as generated_id, make_array(value@0) as __unnest_placeholder(make_array(tmp_table.value))] | | | RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1 | | | BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: UInt64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }], mode=[Sorted] | | | LazyMemoryExec: partitions=1, batch_generators=[range: start=1, end=5, batch_size=8192] | | | | +---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.010 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org