blaginin commented on issue #14563: URL: https://github.com/apache/datafusion/issues/14563#issuecomment-2654717456
I really like that idea, Bruce! I tried to break your branch, but everything seems to work 🙂 I think the issue was that on every rename, we tried to recursively normalize _every_ column for the query, which is very expensive. You could also potentially just normalize only your newly added columns and not touch the rest - if you avoid `normalize_col`, it'll be even faster. I think we can use this issue to make several nice optimizations that will complement each other: - The simplest one: do not call expensive operations when column is _already normalized_. The most obvious example is [this](https://github.com/apache/datafusion/compare/main...blaginin:datafusion:early-exit-on-normalization) (gives a 30% increase in your benchmarks; maybe I'll find other places as well). - Do not normalize columns (your PoC), which will boost benchmarks further. But it will still require normalization in some cases (`(..., true)` in your current [PR](https://github.com/apache/datafusion/compare/main...Omega359:arrow-datafusion:with_column_updates#diff-997707d7dfcac94032b84a25bc0010c62209bf767e3abc6580a55a0a97c19de2R1727)). - For those cases with normalization, we can make an improvement by reusing the existing projection (my PoC from yesterday). I think it's overall a good idea to keep the plan simple - we'll spend less time simplifying and executing it later What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org