GitHub user devozerov created a discussion: How do you account memory in production-grade analytical engines based on DataFusion? Async Parquet reader makes it difficult
DataFusion accounts memory only for blocking operators, assuming that there is some more or less fixed overhead on other data structures. We are building an analytical engine where the main data source is Parquet. Users may submit numerous pipelines all ready multiple Parquet files asynchronously. Each such reader is an instance of async Arrow Parquet reader which is known to be memory-hungry as it reads the whole row groups in RAM. Therefore, the typical Data Fusion assumption about some small memory overhead on non-blocking operators doesn't work. In fact, we observe a lot of page faults and crashes due to aggressive Parquet reader behavior. Are there any current best practices how to deal with this, or plans for improvements? GitHub link: https://github.com/apache/datafusion/discussions/17844 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
