Hi,
also building entire environments in containers may increase their sizes
massively.
Regards,
Gourav Sengupta
On Sat, Dec 4, 2021 at 7:52 AM Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:
> Hi Mich,
>
>
>
> sure thats possible. But distributing the complete env would be more
> pra
This is probably because your data size is well under the broadcastJoin
threshold so at the planning phase it decides to do a BroadcastJoin instead of
a Join which could take advantage of dynamic partition pruning. For testing
like this you can always disable that with
spark.sql.autoBroadcastJo
Hi Meikel
In the past I tried with
--py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
--archives
hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \
which is basically what you are doing. the first line --py-files works but
the seco
Thank you for your response. It was a good point "under the broadcast join
threshold." We test it on real data sets with tables size TBs, but instead,
Spark uses merge sort join without DPP. Anyway, you said that the DPP is not
implemented for broadcast joins? So, I wonder how DPP can be benefic