alamb commented on issue #13983:
URL: https://github.com/apache/datafusion/issues/13983#issuecomment-2611999149

   > I think Q8, Q16~18, Q35 can be closer to `hyper` in 44.0, they are 
improved in [#12996](https://github.com/apache/datafusion/pull/12996) And Q35 
can be even much faster when 
[#13617](https://github.com/apache/datafusion/pull/13617) is merged 
(unfortunately, it can just be released in 46.0 for my long delay recently...)
   > 
   > But Q23 is unbelievalbely fast in hyper... I think we may need to profile 
and think how can we improve it.
   
   I agree -- in case anyone else wants to see hyper reported 5x faster than 
DataFusion and 6x faster than DuckDB
   
   <img width="1096" alt="Image" 
src="https://github.com/user-attachments/assets/f52e7d8d-73b7-4805-93b3-12162467dd8a";
 />
   
   I think this is Q23
   
https://github.com/apache/datafusion/blob/11b7b5c215012231e5768fc5be3445c0254d0169/benchmarks/queries/clickbench/queries.sql#L22
   
   ```sql
   SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, 
COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT 
LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c 
DESC LIMIT 10;
   ```
   
   Profiling it like this:
   ```shell
   $ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), 
COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" 
LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' 
GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"
   ```
   
   26% of the time goes to snappy decompression and 40% of the time to utf8 
validation:
   
   <img width="1728" alt="Image" 
src="https://github.com/user-attachments/assets/60abeb01-6dcd-47d5-9c0c-ca7b46c82007";
 />
   
   Here is the full 
[flamegraph.svg](https://github.com/user-attachments/assets/00438b49-5348-4c2f-94c4-d8867fbdc502)
   
   
   So by my calculations the snappy decompression time alone in DataFusion  
(0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to