Hi Sébastien, I came with a similar problem some time ago, you can see the discussion in the Spark users mailing list at http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results . My experience was that when you create too many RDDs the Spark scheduler gets stuck, so if you have many keys in the map you are creating you'll probably have problems. On the other hand, the latest example I proposed in that mailing thread was a batch job in which we start from a single RDD of time tagged data, transform the RDD in a list of RDD corresponding to generating windows according to the time tag of the records, and then apply a transformation of RDD to each window RDD, like for example KMeans.run of MLlib. This is very similar to what you propose. So in my humble opinion the approach of generating thousands of RDDs by filtering doesn't work, and a new RDD class should be implemented for this. I have never implemented a custom RDD, but if you want some help I would be happy to join you in this task
Greetings, Juan 2015-04-29 12:56 GMT+02:00 Sébastien Soubré-Lanabère <s.sou...@gmail.com>: > Hello, > > I'm facing a problem with custom RDD transformations. > > I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map > of RDD by key. > > This would be great, for example, in order to process mllib clustering on V > values grouped by K. > > I know I could do it using filter() on my RDD as many times I have keys, > but I'm afraid this would not be efficient (the entire RDD would be read > each time, right ?). Then, I could mapByPartition my RDD before filtering, > but the code is finally huge... > > So, I tried to create a CustomRDD to implement a splitByKey(rdd: RDD[K, > V]): Map[K, RDD[V]] method, which would iterate on the RDD once time only, > but I cannot achieve my development. > > Please, could you tell me first if this is really faisable, and then, could > you give me some pointers ? > > Thank you, > Regards, > Sebastien >