Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

via GitHub Wed, 16 Jul 2025 06:22:18 -0700


GitHub user alamb added a comment to the discussion: Best practices for 
memory-efficient deduplication of pre-sorted Parquet files


👋 

Give your description, I am surprised that this query is using a 
HashAggregateStream -- the hash aggregate needs to buffer the entire dataset in 
RAM / spill it which is why it is likely running out of memory

Given that the data is sorted by col_1 and col_2, I would expect this query to 
use the streaming aggregate operatior (which should not have much memory at all)


What does the plan look like for this:

```sql
EXPLAIN SELECT 
        col_1,
        col_2,
        first_value(col_3) AS col_3
        first_value(col_4) AS col_4
    FROM 
        example 
    GROUP BY 
        col_1, col_2
    ORDER BY 
        col_1, col_2
```

Can you get the different operator when you remove the first/last value 
aggregates?

```sql
EXPLAIN SELECT 
        col_1,
        col_2 -- NOTE remove the first_value / last_value aggregates
    FROM 
        example 
    GROUP BY 
        col_1, col_2
    ORDER BY 
        col_1, col_2
```


GitHub link: 
https://github.com/apache/datafusion/discussions/16776#discussioncomment-13777332

----
This is an automatically sent email for github@datafusion.apache.org.
To unsubscribe, please send an email to: 
github-unsubscr...@datafusion.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

Reply via email to