alamb commented on issue #13983: URL: https://github.com/apache/datafusion/issues/13983#issuecomment-2611999149
> I think Q8, Q16~18, Q35 can be closer to `hyper` in 44.0, they are improved in [#12996](https://github.com/apache/datafusion/pull/12996) And Q35 can be even much faster when [#13617](https://github.com/apache/datafusion/pull/13617) is merged (unfortunately, it can just be released in 46.0 for my long delay recently...) > > But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it. I agree -- in case anyone else wants to see hyper reported 5x faster than DataFusion and 6x faster than DuckDB <img width="1096" alt="Image" src="https://github.com/user-attachments/assets/f52e7d8d-73b7-4805-93b3-12162467dd8a" /> I think this is Q23 https://github.com/apache/datafusion/blob/11b7b5c215012231e5768fc5be3445c0254d0169/benchmarks/queries/clickbench/queries.sql#L22 ```sql SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; ``` Profiling it like this: ```shell $ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;" ``` 26% of the time goes to snappy decompression and 40% of the time to utf8 validation: <img width="1728" alt="Image" src="https://github.com/user-attachments/assets/60abeb01-6dcd-47d5-9c0c-ca7b46c82007" /> Here is the full [flamegraph.svg](https://github.com/user-attachments/assets/00438b49-5348-4c2f-94c4-d8867fbdc502) So by my calculations the snappy decompression time alone in DataFusion (0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org