I may be having a naive question on join / groupBy-agg. During the days of
RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey
(of PairRDDFunctions) with an optional Partition-Strategy (with is number of
partitions or Partitioner) b. join (of PairRDDFunctions) and its variants, I
used to have a way to provide number of partitions

In DataFrame, how do I specify the number of partitions during this
operation? I could use repartition() after the fact. But this would be
another Stage in the Job.

One work around to increase the number of partitions / task during a join is
to set 'spark.sql.shuffle.partitions' it some desired number during
spark-submit. I am trying to see if there is a way to provide this
programmatically for every step of a groupBy-agg / join?

Reason to do it programmatically is so that, depending on the size of the
dataframe, I can use a larger or smaller number of tasks to avoid
OutOfMemoryError.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-join-groupBy-agg-question-tp28849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to