[D] How do you account memory in production-grade analytical engines based on DataFusion? Async Parquet reader makes it difficult [datafusion]

via GitHub Thu, 02 Oct 2025 09:16:47 -0700


GitHub user devozerov created a discussion: How do you account memory in 
production-grade analytical engines based on DataFusion? Async Parquet reader 
makes it difficult


DataFusion accounts memory only for blocking operators, assuming that there is 
some more or less fixed overhead on other data structures.

We are building an analytical engine where the main data source is Parquet. 
Users may submit numerous pipelines all ready multiple Parquet files 
asynchronously. Each such reader is an instance of async Arrow Parquet reader 
which is known to be memory-hungry as it reads the whole row groups in RAM. 
Therefore, the typical Data Fusion assumption about some small memory overhead 
on non-blocking operators doesn't work. In fact, we observe a lot of page 
faults and crashes due to aggressive Parquet reader behavior.

Are there any current best practices how to deal with this, or plans for 
improvements? 

GitHub link: https://github.com/apache/datafusion/discussions/17844

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] How do you account memory in production-grade analytical engines based on DataFusion? Async Parquet reader makes it difficult [datafusion]

Reply via email to