Hmm, here I use spark on local mode on my laptop with 8 cores. The data is
on my local filesystem. Event thought, there an overhead due to the
distributed computation, I found the difference between the runtime of the
two implementations really, really huge. Is there a benchmark on how well
the algorithm implemented in mllib performs ?

On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote:

> Spark has much more overhead, since it's set up to distribute the
> computation. Julia isn't distributed, and so has no such overhead in a
> completely in-core implementation. You generally use Spark when you
> have a problem large enough to warrant distributing, or, your data
> already lives in a distributed store like HDFS.
>
> But it's also possible you're not configuring the implementations the
> same way, yes. There's not enough info here really to say.
>
> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> wrote:
> > Hi all,
> >
> > I'm trying to a run clustering with kmeans algorithm. The size of my data
> > set is about 240k vectors of dimension 384.
> >
> > Solving the problem with the kmeans available in julia (kmean++)
> >
> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
> >
> > take about 8 minutes on a single core.
> >
> > Solving the same problem with spark kmean|| take more than 1.5 hours
> with 8
> > cores!!!!
> >
> > Either they don't implement the same algorithm either I don't understand
> how
> > the kmeans in spark works. Is my data not big enough to take full
> advantage
> > of spark ? At least, I expect to the same runtime.
> >
> >
> > Cheers,
> >
> >
> > Jao
>

Reply via email to