Re: Conda Python Env in K8S

2021-12-04 Thread Gourav Sengupta
Hi, also building entire environments in containers may increase their sizes massively. Regards, Gourav Sengupta On Sat, Dec 4, 2021 at 7:52 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi Mich, > > > > sure thats possible. But distributing the complete env would be more > pra

Re: [Spark CORE][Spark SQL][Advanced]: Why dynamic partition pruning optimization does not work in this scenario?

2021-12-04 Thread Russell Spitzer
This is probably because your data size is well under the broadcastJoin threshold so at the planning phase it decides to do a BroadcastJoin instead of a Join which could take advantage of dynamic partition pruning. For testing like this you can always disable that with spark.sql.autoBroadcastJo

Re: Conda Python Env in K8S

2021-12-04 Thread Mich Talebzadeh
Hi Meikel In the past I tried with --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \ --archives hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \ which is basically what you are doing. the first line --py-files works but the seco

Re: [Spark CORE][Spark SQL][Advanced]: Why dynamic partition pruning optimization does not work in this scenario?

2021-12-04 Thread Mohamadreza Rostami
Thank you for your response. It was a good point "under the broadcast join threshold." We test it on real data sets with tables size TBs, but instead, Spark uses merge sort join without DPP. Anyway, you said that the DPP is not implemented for broadcast joins? So, I wonder how DPP can be benefic