subject:"RDD split into multiple RDDs"

Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang

To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you might

Re: RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère

Hi Juan, Daniel, thank you for your explanations. Indeed, I don't have a big number of keys, at least not enough to stuck the scheduler. I was using a method quite similar as what you post, Juan, and yes it works, but I think this would be more efficient to not call filter on each key. So, I was

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá

Hi Daniel, I understood Sébastien was talking having having a high number of keys, I guess I was prejudiced by my own problem! :) Anyway I don't think you need to use disk or a database to generate a RDD per key, you can use filter which I guess would be more efficient because IO is avoided, espec

Re: RDD split into multiple RDDs

2015-04-29 Thread Daniel Darabos

Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method for saving the RDD into separate files by key in a single pass. Then you can read the files into separate RDDs. On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi Sébasti

Re: RDD split into multiple RDDs

2015-04-29 Thread Juan Rodríguez Hortalá

Hi Sébastien, I came with a similar problem some time ago, you can see the discussion in the Spark users mailing list at http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results . My experience was that when you create too many RDDs the Spark scheduler gets stu

RDD split into multiple RDDs

2015-04-29 Thread Sébastien Soubré-Lanabère

Hello, I'm facing a problem with custom RDD transformations. I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map of RDD by key. This would be great, for example, in order to process mllib clustering on V values grouped by K. I know I could do it using filter() on my RDD a

Re: RDD split into multiple RDDs

Re: RDD split into multiple RDDs

Re: RDD split into multiple RDDs

Re: RDD split into multiple RDDs

Re: RDD split into multiple RDDs

RDD split into multiple RDDs

6 matches

Site Navigation

Mail list logo

Footer information