hiltontj opened a new issue, #10299:
URL: https://github.com/apache/datafusion/issues/10299

   ### Is your feature request related to a problem or challenge?
   
   When reading from parquet files, bloom filters are **_not_** enabled by 
default. It is not immediately obvious that they are not being used when 
performing queries, so there may be users out there who are not aware that 
bloom filters in their parquet files are being ignored.
   
   Part of the issue, however, is that the default behaviour looks to be shared 
between read and write operations.
   
   ### Describe the solution you'd like
   
   It would be ideal if bloom filters were enabled by default on **_read_**. We 
should be careful, however, as I do not think they should be enabled by default 
on **_write_**, where, depending on how they are configured, their inclusion 
can be expensive.
   
   ### Describe alternatives you've considered
   
   Currently, the bloom filters can be enabled, but must be done explicitly. 
For example, with `datafusion-cli`, which uses the default configuration, one 
must enable the setting via the environment, e.g.,
   ```
   DATAFUSION_EXECUTION_PARQUET_BLOOM_FILTER_ENABLED=true datafusion-cli
   ```
   or by setting it explicity, e.g.,
   ```sql
   SET datafusion.execution.parquet.bloom_filter_enabled=true;
   ```
   This may not work for everyone, however, since it may cause problems by 
writing with bloom filters enabled.
   
   ### Additional context
   
   Bloom filters are disabled by default here: 
https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/config.rs#L398-L399
   
   This setting is ultimately used to prune row groups on read here: 
https://github.com/apache/datafusion/blob/37.1.0/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L531-L545
   
   It looks like this setting is also applied on write here: 
https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/file_options/parquet_writer.rs#L68
   
   There is an existing SLT test that explicitly enables this setting when 
performing a query here: 
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/predicates.slt#L509-L547,
 however, I do not see any tests that are using this setting on write.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to