thats good point about skewness and potential join optimizations. i will
try turning off all skew optimizations, and force a sort-merge-join, and
see if it then re-uses shuffle files on the static side.
unfortunately my static side is too large to broadcast. the streaming side
can be broadcasted i
I suspect it is probably because the incoming rows when I joined with static
frame can lead to variable degree of skewness over time and if so it is
probably better to employ different join strategies at run time. But if you
know your Dataset I believe you can just do broadcast join for your cas
i was surprised to find out that if a streaming dataframe is joined with a
static dataframe, that the static dataframe is re-shuffled for every
microbatch, which adds considerable overhead.
wouldn't it make more sense to re-use the shuffle files?
or if that is not possible then load the static da