wiedld commented on issue #11042:
URL: https://github.com/apache/datafusion/issues/11042#issuecomment-2234244223

   > I wonder if this could be related to DataFusion overriding the 
data_page_row_limit setting in 
https://github.com/apache/datafusion/issues/11367 (that @wiedld is working on)
   
   @alamb is bringing up the `data_page_row_limit` since in our own work we 
found that the dict_encoder used alot more memory with the datafusion default 
`data_page_row_limit=usize::max`. It was only once we set 
`data_page_row_limit=20k` that we fixed the memory issue. As a result, we 
decided to [change the arrow-rs/parquet default to 
20k](https://github.com/apache/arrow-rs/pull/5957).
   
   ### The current gotchas with using the defaults in `COPY TO`
   
   So the arrow-rs/parquet writer now has a default data_page_row_limit=20k, so 
we should see that default when we run using datafusion `COPY TO`, right?
   
   Wrong. The [PR description 
here](https://github.com/apache/datafusion/pull/11524#issue-2414442512) gives a 
good overview of how the datafusion session's options are treated in 
arrow-rs/parquet's ArrowWriter. 
   * In some cases, the datafusion defaults **_override_** the arrow-rs/parquet 
defaults.
       * e.g. datafusion default `data_page_row_limit=usize::MAX`, overwrites 
arrow-rs/parquet `data_page_row_limit=20k`.
       * short term fix: do as [alamb 
suggests](https://github.com/apache/datafusion/issues/11042#issuecomment-2232995071)
 and configure the sql `COPY TO` for data_page_row_limit=20k.
   * In other cases, the datafusion defaults **_get ignored_** in 
arrow-rs/parquet.
       * specifically, these are in cases when the datafusion default is None.
       * e.g. datafusion default `dictionary_enabled=None` gets overridden, and 
the actual default behavior is to turn it on (as @hveiga noticed above).
       * short term fix: explicitly set `dictionary_enabled=Some(false)`
   
   
   The [PR description 
here](https://github.com/apache/datafusion/pull/11524#issue-2414442512) gives 
an overview of how the default datafusion settings are treated in 
arrow-rs/parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to