In our case, these UDFs are quite expensive and worked on in an
iterative manner, so being able to cache the two "sides" of the graphs
independently will speed up the development cycle. Otherwise, if you
modify foo() here, then you have to recompute bar and baz, even though
they're unchanged.

df.withColumn('a', foo('x')).withColumn('b', bar('x')).withColumn('c', baz('x'))

Additionally, a longer goal would be to be able to persist/cache these
columns to disk so a downstream user could later mix and match several
(10s) of these columns together as their inputs w/o having to
explicitly compute them themselves.

Cheers
Andrew

On Mon, May 17, 2021 at 1:10 PM Sean Owen <sro...@gmail.com> wrote:
>
> Why join here - just add two columns to the DataFrame directly?
>
> On Mon, May 17, 2021 at 1:04 PM Andrew Melo <andrew.m...@gmail.com> wrote:
>>
>> Anyone have ideas about the below Q?
>>
>> It seems to me that given that "diamond" DAG, that spark could see
>> that the rows haven't been shuffled/filtered, it could do some type of
>> "zip join" to push them together, but I've not been able to get a plan
>> that doesn't do a hash/sort merge join
>>
>> Cheers
>> Andrew
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to