Memory-efficient successive calls to repartition()

2015-08-20 Thread abellet
Hello, For the need of my application, I need to periodically "shuffle" the data across nodes/partitions of a reasonably-large dataset. This is an expensive operation but I only need to do it every now and then. However it seems that I am doing something wrong because as the iterations go the memo

Re: Best way to randomly distribute elements

2015-06-19 Thread abellet
Thanks a lot for the suggestions! Le 18/06/2015 15:02, Himanshu Mehra [via Apache Spark User List] a écrit : > Hi A bellet > > You can try RDD.randomSplit(weights array) where a weights array is the > array of weight you wants to want to put in the consecutive partition > example RDD.randomSplit(A

Best way to randomly distribute elements

2015-06-18 Thread abellet
Hello, In the context of a machine learning algorithm, I need to be able to randomly distribute the elements of a large RDD across partitions (i.e., essentially assign each element to a random partition). How could I achieve this? I have tried to call repartition() with the current number of parti

Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each

Pairwise computations within partition

2015-04-09 Thread abellet
Hello everyone, I am a Spark novice facing a nontrivial problem to solve with Spark. I have an RDD consisting of many elements (say, 60K), where each element is is a d-dimensional vector. I want to implement an iterative algorithm which does the following. At each iteration, I want to apply an o