Thanks for replying. When could you send out the PR? 发件人: Yanbo Liang <yblia...@gmail.com<mailto:yblia...@gmail.com>> 日期: 2016年10月7日 星期五 下午11:35 至: didi <wangleikidd...@didichuxing.com<mailto:wangleikidd...@didichuxing.com>> 抄送: "d...@spark.apache.org<mailto:d...@spark.apache.org>" <d...@spark.apache.org<mailto:d...@spark.apache.org>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wangleikidd...@didichuxing.com<mailto:wangleikidd...@didichuxing.com>> wrote: Hi, Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB? I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228 Now copying the code here: val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) // Create new distributions based on the partial assignments // (often referred to as the "M" step in literature) val sumWeights = sums.weights.sum if (shouldDistributeGaussians) { val numPartitions = math.min(k, 1024) val tuples = Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i))) val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight) => updateWeightsAndGaussians(mean, sigma, weight, sumWeights) }.collect().unzip Array.copy(ws.toArray, 0, weights, 0, ws.length) Array.copy(gs.toArray, 0, gaussians, 0, gs.length) } else { var i = 0 while (i < k) { val (weight, gaussian) = updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights) weights(i) = weight gaussians(i) = gaussian i = i + 1 } } llhp = llh // current becomes previous llh = sums.logLikelihood // this is the freshly computed log-likelihood iter += 1 compute.destroy(blocking = false) In my application, I need to know log likelihood to compare effect for different number of clusters. And then I use the cluster number with the maximum log likelihood. Is it a good idea to expose this value?