wiedld commented on issue #11042: URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234244223
> I wonder if this could be related to DataFusion overriding the data_page_row_limit setting in https://github.com/apache/datafusion/issues/11367 (that @wiedld is working on) @alamb is bringing up the `data_page_row_limit` since in our own work we found that the dict_encoder used alot more memory with the datafusion default `data_page_row_limit=usize::max`. It was only once we set `data_page_row_limit=20k` that we fixed the memory issue. As a result, we decided to [change the arrow-rs/parquet default to 20k](https://github.com/apache/arrow-rs/pull/5957). ### The current gotchas with using the defaults in `COPY TO` So the arrow-rs/parquet writer now has a default data_page_row_limit=20k, so we should see that default when we run using datafusion `COPY TO`, right? Wrong. The [PR description here](https://github.com/apache/datafusion/pull/11524#issue-2414442512) gives a good overview of how the datafusion session's options are treated in arrow-rs/parquet's ArrowWriter. * In some cases, the datafusion defaults **_override_** the arrow-rs/parquet defaults. * e.g. datafusion default `data_page_row_limit=usize::MAX`, overwrites arrow-rs/parquet `data_page_row_limit=20k`. * short term fix: do as [alamb suggests](https://github.com/apache/datafusion/issues/11042#issuecomment-2232995071) and configure the sql `COPY TO` for data_page_row_limit=20k. * In other cases, the datafusion defaults **_get ignored_** in arrow-rs/parquet. * specifically, these are in cases when the datafusion default is None. * e.g. datafusion default `dictionary_enabled=None` gets overridden, and the actual default behavior is to turn it on (as @hveiga noticed above). * short term fix: explicitly set `dictionary_enabled=Some(false)` The [PR description here](https://github.com/apache/datafusion/pull/11524#issue-2414442512) gives an overview of how the default datafusion settings are treated in arrow-rs/parquet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
