Re: RDD[Vector] Immutability issue

ai he Tue, 29 Dec 2015 10:49:48 -0800

Hi salexln,

RDD's immutability depends on the underlying structure. I have the
following example.


------------------------------------------------------------------------------------------------------------------
scala> val m = Array.fill(2, 2)(0)
m: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))


scala> val rdd = sc.parallelize(m)
rdd: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[1]
at parallelize at <console>:23


scala> rdd.collect()
res6: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))


scala> m(0)(1) = 2


scala> rdd.collect()
res8: Array[Array[Int]] = Array(Array(0, 2), Array(0, 0))
------------------------------------------------------------------------------------------------------------------

You see that variable rdd actually changes when its underlying array
changes. Hopefully this helps you.

Best,
Ai

On Mon, Dec 28, 2015 at 12:36 PM, salexln <sale...@gmail.com> wrote:
> Hi guys,
> I know the RDDs are immutable and therefore their value cannot be changed
> but I see the following behaviour:
> I wrote an implementation for FuzzyCMeans algorithm and now I'm testing it,
> so i run the following example:
>
> import org.apache.spark.mllib.clustering.FuzzyCMeans
> import org.apache.spark.mllib.linalg.Vectors
>
> val data =
> sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt")
> val parsedData = data.map(s => Vectors.dense(s.split('
> ').map(_.toDouble))).cache()
>> parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
>> = MapPartitionsRDD[2] at map at <console>:31
>
> val numClusters = 2
> val numIterations = 20
>
> parsedData.foreach{ point => println(point) }
>> [0.0,-8.0]
> [-3.0,-2.0]
> [-3.0,0.0]
> [-3.0,2.0]
> [-2.0,-1.0]
> [-2.0,0.0]
> [-2.0,1.0]
> [-1.0,0.0]
> [0.0,0.0]
> [1.0,0.0]
> [2.0,-1.0]
> [2.0,0.0]
> [2.0,1.0]
> [3.0,-2.0]
> [3.0,0.0]
> [3.0,2.0]
> [0.0,8.0]
>
> val clusters = FuzzyCMeans.train(parsedData, numClusters, numIteration
> parsedData.foreach{ point => println(point) }
>>
> [0.0,-0.4803333185624595]
> [-0.1811743096972924,-0.12078287313152826]
> [-0.06638890786148487,0.0]
> [-0.04005925925925929,0.02670617283950619]
> [-0.12193263222069807,-0.060966316110349035]
> [-0.0512,0.0]
> [NaN,NaN]
> [-0.049382716049382706,0.0]
> [NaN,NaN]
> [0.006830134553650707,0.0]
> [0.05120000000000002,-0.02560000000000001]
> [0.04755220304297078,0.0]
> [0.06581619798335057,0.03290809899167529]
> [0.12010867103812725,-0.0800724473587515]
> [0.10946638900458144,0.0]
> [0.14814814814814817,0.09876543209876545]
> [0.0,0.49119985188436205]
>
>
>
> But how can this be that my method changes the Immutable RDD?
>
> BTW, the signature of the train method, is the following:
>
> train( data: RDD[Vector], clusters: Int, maxIterations: Int)
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Best
Ai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: RDD[Vector] Immutability issue

Reply via email to