To do it in one pass, conceptually what you would need to do is to consume
the entire parent iterator and store the values either in memory or on
disk, which is generally something you want to avoid given that the parent
iterator length is unbounded. If you need to start spilling to disk, you
might
Hi Juan, Daniel,
thank you for your explanations. Indeed, I don't have a big number of keys,
at least not enough to stuck the scheduler.
I was using a method quite similar as what you post, Juan, and yes it
works, but I think this would be more efficient to not call filter on each
key. So, I was
Hi Daniel,
I understood Sébastien was talking having having a high number of keys, I
guess I was prejudiced by my own problem! :) Anyway I don't think you need
to use disk or a database to generate a RDD per key, you can use filter
which I guess would be more efficient because IO is avoided, espec
Check out http://stackoverflow.com/a/26051042/3318517. It's a nice method
for saving the RDD into separate files by key in a single pass. Then you
can read the files into separate RDDs.
On Wed, Apr 29, 2015 at 2:10 PM, Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:
> Hi Sébasti
Hi Sébastien,
I came with a similar problem some time ago, you can see the discussion in
the Spark users mailing list at
http://markmail.org/message/fudmem4yy63p62ar#query:+page:1+mid:qv4gw6czf6lb6hpq+state:results
. My experience was that when you create too many RDDs the Spark scheduler
gets stu
Hello,
I'm facing a problem with custom RDD transformations.
I would like to transform a RDD[K, V] into a Map[K, RDD[V]], meaning a map
of RDD by key.
This would be great, for example, in order to process mllib clustering on V
values grouped by K.
I know I could do it using filter() on my RDD a