Re: [I] Perf: Dataframe with_column and with_column_renamed are slow [datafusion]

via GitHub Wed, 12 Feb 2025 11:58:37 -0800


blaginin commented on issue #14563:
URL: https://github.com/apache/datafusion/issues/14563#issuecomment-2654717456

I really like that idea, Bruce! I tried to break your branch, but everything
seems to work 🙂 I think the issue was that on every rename, we tried to
recursively normalize _every_ column for the query, which is very expensive.
You could also potentially just normalize only your newly added columns and not
touch the rest - if you avoid `normalize_col`, it'll be even faster.

I think we can use this issue to make several nice optimizations that will
complement each other:

- The simplest one: do not call expensive operations when column is _already
normalized_. The most obvious example is
[this](https://github.com/apache/datafusion/compare/main...blaginin:datafusion:early-exit-on-normalization)
(gives a 30% increase in your benchmarks; maybe I'll find other places as
well).
- Do not normalize columns (your PoC), which will boost benchmarks further.
But it will still require normalization in some cases (`(..., true)` in your
current
[PR](https://github.com/apache/datafusion/compare/main...Omega359:arrow-datafusion:with_column_updates#diff-997707d7dfcac94032b84a25bc0010c62209bf767e3abc6580a55a0a97c19de2R1727)).
- For those cases with normalization, we can make an improvement by reusing
the existing projection (my PoC from yesterday). I think it's overall a good
idea to keep the plan simple - we'll spend less time simplifying and executing
it later

What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Perf: Dataframe with_column and with_column_renamed are slow [datafusion]

Reply via email to