GitHub user alamb added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files
👋 Give your description, I am surprised that this query is using a HashAggregateStream -- the hash aggregate needs to buffer the entire dataset in RAM / spill it which is why it is likely running out of memory Given that the data is sorted by col_1 and col_2, I would expect this query to use the streaming aggregate operatior (which should not have much memory at all) What does the plan look like for this: ```sql EXPLAIN SELECT col_1, col_2, first_value(col_3) AS col_3 first_value(col_4) AS col_4 FROM example GROUP BY col_1, col_2 ORDER BY col_1, col_2 ``` Can you get the different operator when you remove the first/last value aggregates? ```sql EXPLAIN SELECT col_1, col_2 -- NOTE remove the first_value / last_value aggregates FROM example GROUP BY col_1, col_2 ORDER BY col_1, col_2 ``` GitHub link: https://github.com/apache/datafusion/discussions/16776#discussioncomment-13777332 ---- This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org