thinkharderdev commented on issue #7845: URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2379416730
> https://github.com/samuelcolvin/batson-perf I think this is a general issue with low-selectivity filters pushed down to the parquet scan. How the row filtering works now is that the column will be decoded once to execute the filter and then decoded again to produce the output batches. If the decoding time is non-trivial (eg it requires zstd decompression of a lot of data) and the filter is not particularly selective then the redundant decoding can easily more than offset the cost of just materializing the whole column and filtering. When the row filtering was initially implemented we discussed keeping the decoded data cached but ended up deciding against it because it can potentially consume a lot of memory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
