alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007
I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) ```sql > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; Elapsed 5.320 seconds. ``` Here is what I think the rewrite is ```sql > SELECT * from 'hits_partitioned' WHERE "WatchID" IN ( SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10 ); Elapsed 12.023 seconds. ``` WatchID is a unique key ```sql > select count(distinct "WatchID"), count(*) from 'hits_partitioned'; +------------------------------------------+----------+ | count(DISTINCT hits_partitioned.WatchID) | count(*) | +------------------------------------------+----------+ | 99997493 | 99997497 | +------------------------------------------+----------+ ``` I also double checked the output ```shell ## orig datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out ## rewrite datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN (SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER BY \"EventTime\" LIMIT 10);" > rewrite.out ## check sort orig.out > orig.out.sort sort rewrite.out > rewrite.out.sort diff orig.out.sort rewrite.out.sort 7c7 < Elapsed 5.649 seconds. --- > Elapsed 11.067 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org