Gaussian Mixture Model clustering

Meethu Mathew Thu, 18 Sep 2014 22:26:49 -0700

Hi all,

We have come up with an initial distributed implementation of GaussianMixture Model in pyspark where the parameters are estimated using theExpectation-Maximization algorithm.Our current implementation considersdiagonal covariance matrix for each component.We did an initial benchmark study on a 2 node Spark standalone clustersetup where each node config is 8 Cores,8 GB RAM, the spark version usedis 1.0.0. We also evaluated python version of k-means available in sparkon the same datasets.Below are the results from this benchmark study.The reported stats are average from 10 runs.Tests were done on multipledatasets with varying number of features and instances.



         Dataset              Gaussian mixture model
                       Kmeans(Python)

Instances       Dimensions      Avg time per iteration  Time for 100 iterations
        Avg time per iteration  Time for 100 iterations
0.7million      13
        7s
        12min
          13s   26min
1.8million      11
        17s
         29min     33s
         53min
10 million      16
        1.6min  2.7hr
          1.2min        2 hr

We are interested in contributing this implementation as a patch toSPARK. Does MLLib accept python implementations? If not, can wecontribute to the pyspark componentI have created a JIRA for the samehttps://issues.apache.org/jira/browse/SPARK-3588 .How do I get theticket assigned to myself?


Please review and suggest how to take this forward.



--

Regards,


*Meethu Mathew*

*Engineer*

*Flytxt*

F: +91 471.2700202

www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us<http://www.twitter.com/flytxt> | _Connect on Linkedin<http://www.linkedin.com/home?trk=hb_tab_home_top>_

Gaussian Mixture Model clustering

Reply via email to