thinkharderdev commented on issue #7845:
URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2379416730

   > https://github.com/samuelcolvin/batson-perf
   
   I think this is a general issue with low-selectivity filters pushed down to 
the parquet scan. How the row filtering works now is that the column will be 
decoded once to execute the filter and then decoded again to produce the output 
batches. If the decoding time is non-trivial (eg it requires zstd decompression 
of a lot of data) and the filter is not particularly selective then the 
redundant decoding can easily more than offset the cost of just materializing 
the whole column and filtering.
   
   When the row filtering was initially implemented we discussed keeping the 
decoded data cached but ended up deciding against it because it can potentially 
consume a lot of memory


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to