is done with Spark.
But anyway, that is the very thing Spark is advertised for.
> From: so...@cloudera.com
> Date: Sat, 6 Dec 2014 06:39:10 -0600
> Subject: Re: Java RDD Union
> To: ronalday...@live.com
> CC: user@spark.apache.org
>
> I guess a major problem with this is that
I guess a major problem with this is that you lose fault tolerance.
You have no way of recreating the local state of the mutable RDD if a
partition is lost.
Why would you need thousands of RDDs for kmeans? it's a few per iteration.
An RDD is more bookkeeping that data structure, itself. They don'
?
> From: so...@cloudera.com
> Date: Fri, 5 Dec 2014 14:58:37 -0600
> Subject: Re: Java RDD Union
> To: ronalday...@live.com; user@spark.apache.org
>
> foreach also creates a new RDD, and does not modify an existing RDD.
> However, in practice, nothing stops you from fiddling
re in fact not
> changing but there are referents are and somehow this will no longer work
> when clustering.
>
> Thanks for comments.
>
>> From: so...@cloudera.com
>> Date: Fri, 5 Dec 2014 14:22:38 -0600
>> Subject: Re: Java RDD Union
>> To: ronalday...@live.co
Hi Ron,
Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.
Here's a quick test:
scala> val firstRDD = sc.parallelize(1 to 5)
firstRDD: org.apache.spark.rdd.RDD[I
No, RDDs are immutable. union() creates a new RDD, and does not modify
an existing RDD. Maybe this obviates the question. I'm not sure what
you mean about releasing from memory. If you want to repartition the
unioned RDD, you repartition the result of union(), not anything else.
On Fri, Dec 5, 201
I'm a bit confused regarding expected behavior of unions. I'm running on 8
cores. I have an RDD that is used to collect cluster associations (cluster id,
content id, distance) for internal clusters as well as leaf clusters since I'm
doing hierarchical k-means and need all distances for sorting d