asubiotto commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2718168220
I'm playing around with some SQL queries and came across a situation where
an ordered aggregation could be used but it is not, leading to very high memory
usage. Concretely, I have the need to unnest a list and assign the same row id
to all elements of this list and reconstruct it downstream with `array_agg`.
Here is a simplified example with the associated explain plan:
```
EXPLAIN WITH unnested AS (SELECT
ROW_NUMBER() OVER () AS generated_id,
unnest(array[value]) as ar
FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id;
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: array_agg(unnested.ar)
|
| | Aggregate: groupBy=[[unnested.generated_id]],
aggr=[[array_agg(unnested.ar)]]
|
| | SubqueryAlias: unnested
|
| | Projection: generated_id,
__unnest_placeholder(make_array(tmp_table.value),depth=1) AS
UNNEST(make_array(tmp_table.value)) AS ar
|
| | Unnest:
lists[__unnest_placeholder(make_array(tmp_table.value))|depth=1] structs[]
|
| | Projection: row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING AS generated_id, make_array(tmp_table.value)
AS __unnest_placeholder(make_array(tmp_table.value))
|
| | WindowAggr: windowExpr=[[row_number() ROWS
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
|
| | TableScan: tmp_table projection=[value]
|
| physical_plan | ProjectionExec: expr=[array_agg(unnested.ar)@1 as
array_agg(unnested.ar)]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([generated_id@0],
12), input_partitions=12
|
| | AggregateExec: mode=Partial, gby=[generated_id@0
as generated_id], aggr=[array_agg(unnested.ar)]
|
| | ProjectionExec: expr=[generated_id@1 as
generated_id, __unnest_placeholder(make_array(tmp_table.value),depth=1)@2 as
ar]
|
| | UnnestExec
|
| | ProjectionExec: expr=[row_number() ROWS
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as row_number() ROWS
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, row_number() ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as generated_id,
make_array(value@0) as __unnest_placeholder(make_array(tmp_table.value))]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(12), input_partitions=1
|
| | BoundedWindowAggExec: wdw=[row_number()
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name:
"row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING",
data_type: UInt64, nullable: false, dict_id: 0, dict_is_ordered: false,
metadata: {} }), frame: WindowFrame { units: Rows, start_bound:
Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false
}], mode=[Sorted] |
| | LazyMemoryExec: partitions=1,
batch_generators=[range: start=1, end=5, batch_size=8192]
|
| |
|
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.010 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
