hveiga commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2218143802
> BTW something we have seen in InfluxDB, especially for very compressible data, was that the arrow writer was consuming substantial memory. > > Something that might be worth testing would be to set the parquet writer's options to set `data_page_row_limit` to something like 20,000 > > By default it is unlimited. We just changed the default upstream in arrow-rs [apache/arrow-rs#5957](https://github.com/apache/arrow-rs/pull/5957) but that is not yet released Thanks for the suggestion @alamb. I tested with a different value for `data_page_row_limit` but got the same result. In general I have been having a hard time trying to debug this since there is no `heaptrack` for Mac and the build process for `heaptrack_gui` is also broken at the moment as I cannot installed one of the required dependencies. I did try valgrind and the RustRover profiler, but could not find anything relevant. I'll wait for https://github.com/apache/datafusion/issues/11344 to land and test again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
