Yes, the count() should be the first task, and the sampling + collecting
should be the second task.  The first one is probably slow because the RDD
being sampled is not yet cached/materialized.

K-Means creates some RDDs internally while learning, and since they aren't
needed after learning, they are unpersisted (uncached) at the end.

Joseph

On Sat, Apr 25, 2015 at 6:36 AM, podioss <grega...@hotmail.com> wrote:

> Hi,
> i am running k-means algorithm with initialization mode set to random and
> various dataset sizes and values for clusters and i have a question
> regarding the takeSample job of the algorithm.
> More specific i notice that in every application there are  two sampling
> jobs. The first one is consuming the most time compared to all others while
> the second one is much quicker and that sparked my interest to investigate
> what is actually happening.
> In order to explain it, i  checked the source code of the takeSample
> operation and i saw that there is a count action involved and then the
> computation of a PartiotionwiseSampledRDD with a PoissonSampler.
> So my question is,if that count action corresponds to the first takeSample
> job and if the second takeSample job is the one doing the actual sampling.
>
> I also have a question for the RDDs that are created for the k-means. In
> the
> middle of the execution under the storage tab of the web ui i can see 3
> RDDs
> with their partitions cached in memory across all nodes which is very
> helpful for monitoring reasons. The problem is that after the completion i
> can only see one of them and the portion of the cache memory it used and i
> would like to ask why the web ui doesn't display all the RDDs involded in
> the computation.
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to