2010YOUY01 opened a new issue, #12937:
URL: https://github.com/apache/datafusion/issues/12937

   ### Describe the bug
   
   Actual memory consumption remains high even if the memory limit is specified
   
   
   ### To Reproduce
   
   Compile and run datafusion-cli with:
   ```sh
   # /usr/bin/time to measure actual memory usage
   /usr/bin/time -l cargo run -- --mem-pool-type fair -m 1G -f 'aggr.sql'
   ```
   And it showed max memory resident set size is ~400M
   
   Next, set the memory limit to 200M and run it again, max RSS is still ~400M
   Use `explain analyze` can show that the spill actually happened
   And the query has ~1M groups, if changing it to 4 groups (see `aggr.sql`), 
max RSS is around 80M, the major memory consumer should be aggregation, so it's 
likely there is some issue with external aggregation implementation
   
   <details>
     <summary>aggr.sql</summary>
   
     Change file path to parquet lineitem file generated by 
https://github.com/apache/datafusion/tree/main/benchmarks
   ```sql
   -- aggr.sql
   CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
           l_orderkey BIGINT,
           l_partkey BIGINT,
           l_suppkey BIGINT,
           l_linenumber INTEGER,
           l_quantity DECIMAL(15, 2),
           l_extendedprice DECIMAL(15, 2),
           l_discount DECIMAL(15, 2),
           l_tax DECIMAL(15, 2),
           l_returnflag VARCHAR,
           l_linestatus VARCHAR,
           l_shipdate DATE,
           l_commitdate DATE,
           l_receiptdate DATE,
           l_shipinstruct VARCHAR,
           l_shipmode VARCHAR,
           l_comment VARCHAR,
   ) STORED AS parquet
   LOCATION 
'/Users/yongting/Desktop/code/my_datafusion/arrow-datafusion/benchmarks/data/tpch_sf1/lineitem';
   
   explain analyze
   SELECT
   count(*)
   FROM lineitem
   GROUP BY l_suppkey, l_partkey, l_returnflag, l_linestatus; -- cardinality is 
~36% of lineitem
   -- GROUP BY l_returnflag, l_linestatus -- 4 groups
   ```
   
   
   </details>
   
   
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to