Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

via GitHub Tue, 09 Jul 2024 09:30:03 -0700


hveiga commented on issue #11042:
URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2218143802


   > BTW something we have seen in InfluxDB, especially for very compressible 
data, was that the arrow writer was consuming substantial memory.
   > 
   > Something that might be worth testing would be to set the parquet writer's 
options to set `data_page_row_limit` to something like 20,000
   > 
   > By default it is unlimited. We just changed the default upstream in 
arrow-rs [apache/arrow-rs#5957](https://github.com/apache/arrow-rs/pull/5957) 
but that is not yet released
   
   Thanks for the suggestion @alamb. I tested with a different value for 
`data_page_row_limit` but got the same result.
   
   In general I have been having a hard time trying to debug this since there 
is no `heaptrack` for Mac and the build process for `heaptrack_gui` is also 
broken at the moment as I cannot installed one of the required dependencies. I 
did try valgrind and the RustRover profiler, but could not find anything 
relevant.
   
   I'll wait for https://github.com/apache/datafusion/issues/11344 to land and 
test again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Potential memory issue when using COPY with PARTITIONED BY [datafusion]

Reply via email to