Yes, the count() should be the first task, and the sampling + collecting should be the second task. The first one is probably slow because the RDD being sampled is not yet cached/materialized.
K-Means creates some RDDs internally while learning, and since they aren't needed after learning, they are unpersisted (uncached) at the end. Joseph On Sat, Apr 25, 2015 at 6:36 AM, podioss <grega...@hotmail.com> wrote: > Hi, > i am running k-means algorithm with initialization mode set to random and > various dataset sizes and values for clusters and i have a question > regarding the takeSample job of the algorithm. > More specific i notice that in every application there are two sampling > jobs. The first one is consuming the most time compared to all others while > the second one is much quicker and that sparked my interest to investigate > what is actually happening. > In order to explain it, i checked the source code of the takeSample > operation and i saw that there is a count action involved and then the > computation of a PartiotionwiseSampledRDD with a PoissonSampler. > So my question is,if that count action corresponds to the first takeSample > job and if the second takeSample job is the one doing the actual sampling. > > I also have a question for the RDDs that are created for the k-means. In > the > middle of the execution under the storage tab of the web ui i can see 3 > RDDs > with their partitions cached in memory across all nodes which is very > helpful for monitoring reasons. The problem is that after the completion i > can only see one of them and the portion of the cache memory it used and i > would like to ask why the web ui doesn't display all the RDDs involded in > the computation. > > Thank you > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >