Hi Sébastien,

I came with a similar problem some time ago, you can see the discussion in
the Spark users mailing list at
http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results
. My experience was that when you create too many RDDs the Spark scheduler
gets stuck, so if you have many keys in the map you are creating you'll
probably have problems. On the other hand, the latest example I proposed in
that mailing thread was a batch job in which we start from a single RDD of
time tagged data, transform the RDD in a list of RDD corresponding to
generating windows according to the time tag of the records, and then apply
a transformation of RDD to each window RDD, like for example KMeans.run of
MLlib. This is very similar to what you propose.
So in my humble opinion the approach of generating thousands of RDDs by
filtering doesn't work, and a new RDD class should be implemented for this.
I have never implemented a custom RDD, but if you want some help I would be
happy to join you in this task

Greetings,

Juan



2015-04-29 12:56 GMT+02:00 Sébastien Soubré-Lanabère <s.sou...@gmail.com>:

> Hello,
>
> I'm facing a problem with custom RDD transformations.
>
> I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map
> of RDD by key.
>
> This would be great, for example, in order to process mllib clustering on V
> values grouped by K.
>
> I know I could do it using filter() on my RDD as many times I have keys,
> but I'm afraid this would not be efficient (the entire RDD would be read
> each time, right ?). Then, I could mapByPartition my RDD before filtering,
> but the code is finally huge...
>
> So, I tried to create a CustomRDD to implement a splitByKey(rdd: RDD[K,
> V]): Map[K, RDD[V]] method, which would iterate on the RDD once time only,
> but I cannot achieve my development.
>
> Please, could you tell me first if this is really faisable, and then, could
> you give me some pointers ?
>
> Thank you,
> Regards,
> Sebastien
>

Reply via email to