How to partition a SparkDataFrame using all distinct column values in sparkR

Neil Chang Mon, 25 Jul 2016 15:47:08 -0700

Hi,
  This is a question regarding SparkR in spark 2.0.

Given that I have a SparkDataFrame and I want to partition it using one
column's values. Each value corresponds to a partition, all rows that
having the same column value shall go to the same partition, no more no
less.


   Seems the function repartition() doesn't do this, I have 394 unique
values, it just partitions my DataFrame into 200. If I specify the
numPartitions to 394, some mismatch happens.

Is it possible to do what I described in sparkR?
GroupBy doesn't work with udf at all.

Or can we split the DataFrame into list of small ones first, if so, what
can I use?

Thanks,
Neil

How to partition a SparkDataFrame using all distinct column values in sparkR

Reply via email to