alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007

   I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec 
vs 12sec)
   
   ```sql
   > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY 
"EventTime" LIMIT 10;
   Elapsed 5.320 seconds.
   ```
   
   Here is what I think the rewrite is
   ```sql
   > SELECT * from 'hits_partitioned' WHERE "WatchID" IN (
     SELECT "WatchID" FROM 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER 
BY "EventTime" LIMIT 10
   );
   
   Elapsed 12.023 seconds.
   ```
   
   WatchID is a unique key
   
   ```sql
   > select count(distinct "WatchID"), count(*) from 'hits_partitioned';
   +------------------------------------------+----------+
   | count(DISTINCT hits_partitioned.WatchID) | count(*) |
   +------------------------------------------+----------+
   | 99997493                                 | 99997497 |
   +------------------------------------------+----------+
   ```
   
   I also double checked the output
   
   ```shell
   ## orig
   datafusion-cli -c "SELECT * FROM 'hits_partitioned' WHERE \"URL\" LIKE 
'%google%' ORDER BY \"EventTime\" LIMIT 10;" > orig.out
   
   ## rewrite
    datafusion-cli -c "SELECT * from 'hits_partitioned' WHERE \"WatchID\" IN 
(SELECT \"WatchID\" FROM 'hits_partitioned' WHERE \"URL\" LIKE '%google%' ORDER 
BY \"EventTime\" LIMIT 10);" > rewrite.out
   
   ## check
   sort orig.out > orig.out.sort
   sort rewrite.out > rewrite.out.sort
   diff orig.out.sort rewrite.out.sort
   
   7c7
   < Elapsed 5.649 seconds.
   ---
   > Elapsed 11.067 seconds.
   
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to