Following are my questions. Thank you. 1. When joining dataframes is it a good idea to repartition on the key column that is used in the join or the optimizer is too smart so forget it.
2. In RDD join, wherever possible we do reduceByKey before the join to avoid a big shuffle of data. Do we need to do anything similar with dataframe joins, or the optimizer is too smart so forget it.