The definition of org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, long k12, long k21, long k22):
public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) { Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >= 0); // note that we have counts here, not probabilities, and that the entropy is not normalized. * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* double matrixEntropy = entropy(k11, k12, k21, k22); if (rowEntropy + columnEntropy > matrixEntropy) { // round off error return 0.0; } return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); } The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I think it should be: * double rowEntropy = entropy(k11+k12, k21+k22)* * double columnEntropy = entropy(k11+k21, k12+k22)* * * which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) *referred from http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in this example), and I is the mutual infomation. [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = k11/N. [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - H(colSums(k) The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we get: *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* * * that multiplied by 2.0 is just the LLR. Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong or have I misunderstood something?