Hi all,
thank you for your response.
after taking a look at the implementations of rdd.collect(), I thought of
using the rdd.runJob(...) method .
for (int i = 0; i < dataFrame.rdd().partitions().length; i++) {
dataFrame.sqlContext().sparkContext().runJob(data.rdd(),
some function
Hi,
Maybe you could use zipWithIndex and filter to skip the first elements. For
example starting from
scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect
res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3),
(104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (11
I think rdd.toLocalIterator is what you want. But it will keep one
partition's data in-memory.
On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera
wrote:
> Hi all,
>
> I have a large set of data which would not fit into the memory. So, I wan
> to take n number of data from the RDD given a particular