RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
is done with Spark. But anyway, that is the very thing Spark is advertised for. > From: so...@cloudera.com > Date: Sat, 6 Dec 2014 06:39:10 -0600 > Subject: Re: Java RDD Union > To: ronalday...@live.com > CC: user@spark.apache.org > > I guess a major problem with this is that

Re: Java RDD Union

2014-12-06 Thread Sean Owen
I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They don'

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
? > From: so...@cloudera.com > Date: Fri, 5 Dec 2014 14:58:37 -0600 > Subject: Re: Java RDD Union > To: ronalday...@live.com; user@spark.apache.org > > foreach also creates a new RDD, and does not modify an existing RDD. > However, in practice, nothing stops you from fiddling

Re: Java RDD Union

2014-12-05 Thread Sean Owen
re in fact not > changing but there are referents are and somehow this will no longer work > when clustering. > > Thanks for comments. > >> From: so...@cloudera.com >> Date: Fri, 5 Dec 2014 14:22:38 -0600 >> Subject: Re: Java RDD Union >> To: ronalday...@live.co

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala> val firstRDD = sc.parallelize(1 to 5) firstRDD: org.apache.spark.rdd.RDD[I

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 201

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting d