alamb commented on issue #13983: URL: https://github.com/apache/datafusion/issues/13983#issuecomment-2612064902
> Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that. Running without filter pushdown (the default) ```sql set datafusion.execution.parquet.pushdown_filters = false; SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; ``` I get: Elapsed 2.232 seconds. Elapsed 2.252 seconds. Elapsed 2.236 seconds. When I enabled filter pushdown it goes 15% faster. ```sql set datafusion.execution.parquet.pushdown_filters = true; SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; ``` I get: Elapsed 1.981 seconds. Elapsed 1.953 seconds. Elapsed 1.966 seconds. Still not 5x though 🤔 Though it gives me new motivation tohelp @XiangpengHao get the pushdown improvements over the line in https://github.com/apache/arrow-rs/pull/6921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org