Re: Why KMeans with mllib is so slow ?

Sean Owen Fri, 05 Dec 2014 07:57:13 -0800

Spark has much more overhead, since it's set up to distribute the
computation. Julia isn't distributed, and so has no such overhead in a
completely in-core implementation. You generally use Spark when you
have a problem large enough to warrant distributing, or, your data
already lives in a distributed store like HDFS.


But it's also possible you're not configuring the implementations the
same way, yes. There's not enough info here really to say.

On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
> Hi all,
>
> I'm trying to a run clustering with kmeans algorithm. The size of my data
> set is about 240k vectors of dimension 384.
>
> Solving the problem with the kmeans available in julia (kmean++)
>
> http://clusteringjl.readthedocs.org/en/latest/kmeans.html
>
> take about 8 minutes on a single core.
>
> Solving the same problem with spark kmean|| take more than 1.5 hours with 8
> cores!!!!
>
> Either they don't implement the same algorithm either I don't understand how
> the kmeans in spark works. Is my data not big enough to take full advantage
> of spark ? At least, I expect to the same runtime.
>
>
> Cheers,
>
>
> Jao

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why KMeans with mllib is so slow ?

Reply via email to