In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged.
df.withColumn('a', foo('x')).withColumn('b', bar('x')).withColumn('c', baz('x')) Additionally, a longer goal would be to be able to persist/cache these columns to disk so a downstream user could later mix and match several (10s) of these columns together as their inputs w/o having to explicitly compute them themselves. Cheers Andrew On Mon, May 17, 2021 at 1:10 PM Sean Owen <sro...@gmail.com> wrote: > > Why join here - just add two columns to the DataFrame directly? > > On Mon, May 17, 2021 at 1:04 PM Andrew Melo <andrew.m...@gmail.com> wrote: >> >> Anyone have ideas about the below Q? >> >> It seems to me that given that "diamond" DAG, that spark could see >> that the rows haven't been shuffled/filtered, it could do some type of >> "zip join" to push them together, but I've not been able to get a plan >> that doesn't do a hash/sort merge join >> >> Cheers >> Andrew >> --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org