Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang
To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you might

Re: RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hi Juan, Daniel, thank you for your explanations. Indeed, I don't have a big number of keys, at least not enough to stuck the scheduler. I was using a method quite similar as what you post, Juan, and yes it works, but I think this would be more efficient to not call filter on each key. So, I was

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
Hi Daniel, I understood Sébastien was talking having having a high number of keys, I guess I was prejudiced by my own problem! :) Anyway I don't think you need to use disk or a database to generate a RDD per key, you can use filter which I guess would be more efficient because IO is avoided, espec

Re: RDD split into multiple RDDs

2015-04-29 Thread Daniel Darabos
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method for saving the RDD into separate files by key in a single pass. Then you can read the files into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi Sébasti

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá
Hi Sébastien, I came with a similar problem some time ago, you can see the discussion in the Spark users mailing list at http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results . My experience was that when you create too many RDDs the Spark scheduler gets stu

RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère
Hello, I'm facing a problem with custom RDD transformations. I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map of RDD by key. This would be great, for example, in order to process mllib clustering on V values grouped by K. I know I could do it using filter() on my RDD a