Hi, This is a question regarding SparkR in spark 2.0. Given that I have a SparkDataFrame and I want to partition it using one column's values. Each value corresponds to a partition, all rows that having the same column value shall go to the same partition, no more no less.
Seems the function repartition() doesn't do this, I have 394 unique values, it just partitions my DataFrame into 200. If I specify the numPartitions to 394, some mismatch happens. Is it possible to do what I described in sparkR? GroupBy doesn't work with udf at all. Or can we split the DataFrame into list of small ones first, if so, what can I use? Thanks, Neil