bharath-techie commented on issue #19216:
URL: https://github.com/apache/datafusion/issues/19216#issuecomment-3635177078

   So this is the error I get : 
   ```
   datafusion-cli -m 8G
   DataFusion CLI v51.0.0
   > SET datafusion.execution.target_partitions=1;
   0 row(s) fetched. 
   Elapsed 0.001 seconds.
   
   > CREATE EXTERNAL TABLE hits 
   STORED AS PARQUET 
   LOCATION '/home/ec2-user/clickdata/partitioned/hits/*.parquet';
   0 row(s) fetched. 
   Elapsed 0.054 seconds.
   
   > SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER BY c DESC LIMIT 
10;
   Resources exhausted: Additional allocation failed for TopK[0] with top 
memory consumers (across reservations) as:
     TopK[0]#4(can spill: false) consumed 7.8 GB, peak 7.8 GB,
     GroupedHashAggregateStream[0] (count(1))#3(can spill: true) consumed 80.4 
KB, peak 4.6 GB,
     DataFusion-Cli#2(can spill: false) consumed 0.0 B, peak 0.0 B.
   Error: Failed to allocate additional 3.9 GB for TopK[0] with 7.8 GB already 
allocated for this reservation - 187.0 MB remain available for the total pool
   > 
   ```
   Once i added spill to `TopK`, I was able to get results back with same 
memory config 
   
   ```
   explain analyze SELECT "URL", COUNT(*) AS c FROM hits GROUP BY "URL" ORDER 
BY c DESC LIMIT 10;
   +-------------------                                                         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
                                                                                
                                                                                
                             
   
   | Plan with Metrics | SortExec: TopK(fetch=10), expr=[c@1 DESC], 
preserve_partitioning=[false], filter=[c@1 IS NULL OR c@1 > 57340], 
metrics=[output_rows=10, elapsed_compute=81.54ms, output_bytes=50.0 MB, 
output_batches=1, spill_count=19, spilled_bytes=115.1 MB, spilled_rows=125, 
row_replacements=189]                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                        |
   |                   |   ProjectionExec: expr=[URL@0 as URL, 
count(Int64(1))@1 as c], metrics=[output_rows=18.34 M, elapsed_compute=1.54ms, 
output_bytes=8.6 TB, output_batches=2.24 K]                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                        |
   |                   |     AggregateExec: mode=Single, gby=[URL@0 as URL], 
aggr=[count(Int64(1))], metrics=[output_rows=18.34 M, elapsed_compute=14.45s, 
output_bytes=8.6 TB, output_batches=2.24 K, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, peak_mem_used=4.90 B, aggregate_arguments_time=38.36ms, 
aggregation_time=389.52ms, emitting_time=1.30ms, 
time_calculating_group_ids=13.93s]                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
  
                                                                                
                                                                                
                                                                                
                                                                                
                                        |
   |                   |       DataSourceExec: file_groups={1 group: 
[[home/ec2-user/clickdata/partitioned/hits/hits_0.parquet:0..122446530, 
home/ec2-user/clickdata/partitioned/hits/hits_1.parquet:0..174965044, 
home/ec2-user/clickdata/partitioned/hits/hits_10.parquet:0..101513258, 
home/ec2-user/clickdata/partitioned/hits/hits_11.parquet:0..118419888, 
home/ec2-user/clickdata/partitioned/hits/hits_12.parquet:0..149514164, ...]]}, 
projection=[URL], file_type=parquet, metrics=[output_rows=100.00 M, 
elapsed_compute=1ns, output_bytes=20.0 GB, output_batches=12.31 K, 
files_ranges_pruned_statistics=100 total → 100 matched, 
row_groups_pruned_statistics=325 total → 325 matched, 
row_groups_pruned_bloom_filter=325 total → 325 matched, 
page_index_rows_pruned=0 total → 0 matched, batches_split=0, bytes_scanned=2.62 
B, file_open_errors=0, file_scan_errors=0, num_predicate_creation_errors=0, 
predicate_cache_inner_records=0, predicate_cache_records=0, 
predicate_evaluation_errors=0, pushdown_r
 ows_matched=0, pushdown_rows_pruned=0, bloom_filter_eval_time=200ns, 
metadata_load_time=8.04ms, page_index_eval_time=200ns, 
row_pushdown_eval_time=200ns, statistics_eval_time=200ns, 
time_elapsed_opening=2.69ms, time_elapsed_processing=6.99s, 
time_elapsed_scanning_total=21.96s, time_elapsed_scanning_until_data=241.98ms, 
scan_efficiency_ratio=1600% (2.62 B/160.3 M)] |
   |                   | 
   ```
   
   The suggestion I was asking was whether anything I can do anything in 
planning configuration to make this work without spill for example. I have all 
defaults. I tried reducing the batch size as topK holds 20*batchSize before 
compacting, but it didn't seem to help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to