Hi,
  This is a question regarding SparkR in spark 2.0.

Given that I have a SparkDataFrame and I want to partition it using one
column's values. Each value corresponds to a partition, all rows that
having the same column value shall go to the same partition, no more no
less.

   Seems the function repartition() doesn't do this, I have 394 unique
values, it just partitions my DataFrame into 200. If I specify the
numPartitions to 394, some mismatch happens.

Is it possible to do what I described in sparkR?
GroupBy doesn't work with udf at all.

Or can we split the DataFrame into list of small ones first, if so, what
can I use?

Thanks,
Neil

Reply via email to