Hi all,
We have come up with an initial distributed implementation of Gaussian
Mixture Model in pyspark where the parameters are estimated using the
Expectation-Maximization algorithm.Our current implementation considers
diagonal covariance matrix for each component.
We did an initial benchmark study on a 2 node Spark standalone cluster
setup where each node config is 8 Cores,8 GB RAM, the spark version used
is 1.0.0. We also evaluated python version of k-means available in spark
on the same datasets.Below are the results from this benchmark study.
The reported stats are average from 10 runs.Tests were done on multiple
datasets with varying number of features and instances.
Dataset Gaussian mixture model
Kmeans(Python)
Instances Dimensions Avg time per iteration Time for 100 iterations
Avg time per iteration Time for 100 iterations
0.7million 13
7s
12min
13s 26min
1.8million 11
17s
29min 33s
53min
10 million 16
1.6min 2.7hr
1.2min 2 hr
We are interested in contributing this implementation as a patch to
SPARK. Does MLLib accept python implementations? If not, can we
contribute to the pyspark component
I have created a JIRA for the same
https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the
ticket assigned to myself?
Please review and suggest how to take this forward.
--
Regards,
*Meethu Mathew*
*Engineer*
*Flytxt*
F: +91 471.2700202
www.flytxt.com | Visit our blog <http://blog.flytxt.com/> | Follow us
<http://www.twitter.com/flytxt> | _Connect on Linkedin
<http://www.linkedin.com/home?trk=hb_tab_home_top>_