alamb opened a new issue, #15005: URL: https://github.com/apache/datafusion/issues/15005
### Is your feature request related to a problem or challenge? ## Introduction This ticket is my weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community. ## Community Highlights * DF 45 Blog post https://datafusion.apache.org/blog/2025/02/20/datafusion-45.0.0/ * @oznur-synnada updated the events page https://github.com/apache/datafusion/pull/14629 * We are hosting a [Google Summer of Code](https://github.com/apache/datafusion/issues/14577) -- thanks again @oznur-synnada for driving this # Releases! - [DataFusion 46](https://github.com/apache/datafusion/issues/14123) Release candidate is available. Huge thank you to @xudong963 for running this release. This one contains a [massive refactor of DataSource](https://github.com/apache/datafusion/pull/14224) from @ozankabak and @mertak-synnada - Also huge shout out to @blaginin for his help chasing down issues blocking the release: https://github.com/apache/datafusion/pull/14685 - Another Huge shout out to @shehabgamin for his help testing and identifying issues pre-release - Check out the [DataFusion 46 Upgrade Guide](https://github.com/apache/datafusion/pull/14891) to help # Performance DataFusion's core value proposition is great performance without having to re-implement it yourself - @Omega359 's improvement to https://github.com/apache/datafusion/pull/14653 - @berkaysynnada improved the sort tracking code more https://github.com/apache/datafusion/pull/14813 - @zjregee made repeat 50% faster: https://github.com/apache/datafusion/pull/14697 - @simonvandel made `to_hex` 2x faster: https://github.com/apache/datafusion/pull/14686 - @simonvandel also made `to_hex` 4x faster: https://github.com/apache/datafusion/pull/14675 (no string copies for the win!) - And @simonvandel also updated `date_trunc` to be 2x faster: https://github.com/apache/datafusion/pull/14593 - @Kev1n8 made `substr` faster: https://github.com/apache/datafusion/pull/14498 # Quality ## Testing ## Bug Fixes DataFusion is in the "we are finding all the corner case bugs now" phase of its life and people are now bashing them down - @joroKr21 's fix for grouping exprs https://github.com/apache/datafusion/pull/14888 - @anlinc helped fixed https://github.com/apache/datafusion/pull/14860 - https://github.com/apache/datafusion/pull/14852 @rluvaton ๐ - @xudong963 https://github.com/apache/datafusion/pull/14569 ## Docs ## Build time ## Cleanups ๐งน - physical-optimizer into its own crate (finally!): thanks to @logan-keede @berkaysynnada and @buraksenn. - [breaking](https://github.com/apache/datafusion/pull/14873) the datafusion core [crate](https://github.com/apache/datafusion/pull/14951) apart (finally!): thanks to @logan-keede and @AdamGS - @onlyjackfrost @niebayes @irenjj @goldmedal and others [have](https://github.com/apache/datafusion/pull/14727) [been](https://github.com/apache/datafusion/pull/14725) [migrating](https://github.com/apache/datafusion/pull/14856) [all](https://github.com/apache/datafusion/pull/14690) our functions to use `invoke_args` etc - @jayzhan211 has been [Fixing up wild card handling ](https://github.com/apache/datafusion/pull/14689) # Features Features under way - Statistics work: https://github.com/apache/datafusion/pull/14699 - ## Better Out of Core Support In general, DataFusion is getting better at handling datasets that are larger than can fit in memory. - @davidhewitt's improvement here https://github.com/apache/datafusion/pull/14868 - @2010YOUY01 's work to improve spilling for StringView https://github.com/apache/datafusion/pull/14823 - @zhuqi-lucas improved datafusion-cli: https://github.com/apache/datafusion/pull/14766 - @Kontinuation improved docs https://github.com/apache/datafusion/pull/14789 and implementation https://github.com/apache/datafusion/pull/14644# and testing https://github.com/apache/datafusion/pull/14642 ## We can have nice things! (Explain plans) - @irenjj took the first step towards https://github.com/apache/datafusion/pull/14677. I'll give you a teaser below. Come help with the follow on work on https://github.com/apache/datafusion/issues/14914 ``` > explain select * from t1 inner join t2 on t1.i=t2.i; +---------------+------------------------------------------------------------+ | plan_type | plan | +---------------+------------------------------------------------------------+ | logical_plan | Inner Join: t1.i = t2.i | | | TableScan: t1 projection=[i] | | | TableScan: t2 projection=[i] | | physical_plan | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | | | โ CoalesceBatchesExec โ | | | โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ | | | โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ | | | โ HashJoinExec โโโโโโโโโโโโโโโโ | | | โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ โ | | | โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ | | | โ DataSourceExec โโ DataSourceExec โ | | | โ -------------------- โโ -------------------- โ | | | โ partition_sizes: [0] โโ partitions: 1 โ | | | โ partitions: 1 โโ partition_sizes: [0] โ | | | โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | | | | +---------------+------------------------------------------------------------+ 2 row(s) fetched. ``` ## Better Error Messages @eliaperantoni is working with various contributors to make the error messages better. This work is tracked in - https://github.com/apache/datafusion/issues/14429 - https://github.com/apache/datafusion/pull/14439 - @onlyjackfrost https://github.com/apache/datafusion/pull/14849 ## Misc - @simonvandel added https://github.com/apache/datafusion/pull/14830 - @Lordworms made expression access nicer: https://github.com/apache/datafusion/pull/14712 - @rkrishn7 did `UNION ALL BY NAME` https://github.com/apache/datafusion/pull/14538 # Looking to get more involved? Please help review code! ๐ฃ DataFusion has a long history of community members [contributing in all aspects of the project](https://datafusion.apache.org/contributor-guide/index.html). Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements. We have [docs about reviews](https://datafusion.apache.org/contributor-guide/index.html#reviewing-pull-requests). TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try `@` mentioning [one of the committers](https://projects.apache.org/committee.html?datafusion). ## Help wanted - I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems Please feel leave your own comments on this ticket if you are looking for help ## Community * [Weekly Call](https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit#heading=h.kpjkpncdmt1g) * Slack/Discord: [info links](https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord) ## Upcoming meetups: * Help schedule some! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org