Dandandan commented on PR #20160: URL: https://github.com/apache/datafusion/pull/20160#issuecomment-3905053370
> [#20160 (comment)](https://github.com/apache/datafusion/pull/20160#issuecomment-3902329306) > > This is the main improvement. Ok - yes I see some improvements here and there but it is still largely regressing main with ~30s (TPCDS runs in ~50s without and ~80s with filter pushdown). See e.g. this run https://github.com/apache/datafusion/pull/20318#issuecomment-3902690761 against main without dynamic filter pushdown. ``` │ QQuery 64 │ 1194.15 ms │ 31181.42 ms │ 26.11x slower │ ``` This ~26x regression (and many others) is still unchanged in this PR: ( ``` │ QQuery 64 │ 28583.66 ms │ 28523.14 ms │ no change │ ``` As we're running with ```DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS=true``` also the main branch is showing the regressions - so we're comparing both "slow" versions. I think I now have an understanding why the current approaches adaptiveness isn't helping _that much_ yet. As we're only checking the filters on `open` it is only sorted / considered / discarded when the query consists of many files i.e. more files than threads. In other cases, it will evaluate / scan the columns regardless of the tracking (as it will open the files directly at the start of the query / query phase when the selectivity is yet unknown). I think for it to work effectively, it needs to integrate more with the parquet reader to remove or add a filter based on the adaptiveness _during_ the scan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
