Thanks much for the detailed explanations. I suspected architectural
support of the notion of rdd of rdds, but my understanding of Spark or
distributed computing in general is not as deep as allowing me to
understand better. so this really helps!
I ended up going with List[RDD]. The collection of
Yes true. That's why I said if and when.
But hopefully I have given correct explanation of why rdd of rdd is not
possible.
On 09-Jun-2015 10:22 pm, "Mark Hamstra" wrote:
> That would constitute a major change in Spark's architecture. It's not
> happening anytime soon.
>
> On Tue, Jun 9, 2015 at
That would constitute a major change in Spark's architecture. It's not
happening anytime soon.
On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar wrote:
> Possibly in future, if and when spark architecture allows workers to
> launch spark jobs (the functions passed to transformation or action APIs o
Possibly in future, if and when spark architecture allows workers to launch
spark jobs (the functions passed to transformation or action APIs of RDD),
it will be possible to have RDD of RDD.
On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar wrote:
> Simillar question was asked before:
> http://apach
Replicating my answer to another question asked today:
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
* RDD is only a handle to the actual data partitions. It has a
reference/pointer to the /SparkContext /object (/sc/) and a list of
partitions.
* The SparkContext is an o
Simillar question was asked before:
http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html
Here is one of the reasons why I think RDD[RDD[T]] is not possible:
- RDD is only a handle to the actual data partitions. It has a
reference/pointer to the *SparkContext* object
On Wednesday, October 22, 2014 9:06 AM, Sean Owen wrote:
> No, there's no such thing as an RDD of RDDs in Spark.
> Here though, why not just operate on an RDD of Lists? or a List of RDDs?
> Usually one of these two is the right approach whenever you feel
> inclined to operate on an RDD of RDDs.
Another approach could be to create artificial keys for each RDD and
convert to PairRDDs. So your first RDD becomes
JavaPairRDD rdd1 with values 1,"1" ; 1,"2" and so on
Second RDD becomes rdd2 is 2, "a"; 2, "b";2,"c"
You can union the two RDDs, groupByKey, countByKey etc and maybe achieve
what yo
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an RDD of RDDs.
On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini wrote:
> Hello,
>