Let's move the discussion to JIRA. Thanks! On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) <wangleikidd...@didichuxing.com> wrote:
> https://issues.apache.org/jira/browse/SPARK-17825 > > Actually I had created a JIRA. Could you let me your progress to avoid > duplicated work. > > Thanks! > > 发件人: didi <wangleikidd...@didichuxing.com> > 日期: 2016年10月8日 星期六 上午12:21 > 至: Yanbo Liang <yblia...@gmail.com> > > 抄送: "d...@spark.apache.org" <d...@spark.apache.org>, "user@spark.apache.org" > <user@spark.apache.org> > 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB? > > Thanks for replying. > When could you send out the PR? > > 发件人: Yanbo Liang <yblia...@gmail.com> > 日期: 2016年10月7日 星期五 下午11:35 > 至: didi <wangleikidd...@didichuxing.com> > 抄送: "d...@spark.apache.org" <d...@spark.apache.org>, "user@spark.apache.org" > <user@spark.apache.org> > 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB? > > It's a good question and I had similar requirement in my work. I'm copying > the implementation from mllib to ml currently, and then exposing the > maximum log likelihood. I will send this PR soon. > > Thanks. > Yanbo > > On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wangleikidd...@didichuxing.com> > wrote: > >> >> Hi, >> >> Do you guys sometimes need to get the log likelihood of EM algorithm in >> MLLIB? >> >> I mean the value in this line https://github.com/apache >> /spark/blob/master/mllib/src/main/scala/org/apache/spark/ >> mllib/clustering/GaussianMixture.scala#L228 >> >> Now copying the code here: >> >> >> val sums = breezeData.treeAggregate(ExpectationSum.zero(k, >> d))(compute.value, _ += _) >> // Create new distributions based on the partial assignments >> // (often referred to as the "M" step in literature) >> val sumWeights = sums.weights.sum >> if (shouldDistributeGaussians) { >> val numPartitions = math.min(k, 1024) >> val tuples = >> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i))) >> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, >> sigma, weight) => >> updateWeightsAndGaussians(mean, sigma, weight, sumWeights) >> }.collect().unzip >> Array.copy(ws.toArray, 0, weights, 0, ws.length) >> Array.copy(gs.toArray, 0, gaussians, 0, gs.length) >> } else { >> var i = 0 >> while (i < k) { >> val (weight, gaussian) = >> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), >> sums.weights(i), sumWeights) >> weights(i) = weight >> gaussians(i) = gaussian >> i = i + 1 >> } >> } >> llhp = llh // current becomes previous >> llh = sums.logLikelihood // this is the freshly computed log-likelihood >> iter += 1 >> compute.destroy(blocking = false) In my application, I need to know log >> likelihood to compare effect for different number of clusters. >> And then I use the cluster number with the maximum log likelihood. >> >> Is it a good idea to expose this value? >> >> >> >> >