Hi, I'm currently using the R package e1071 to train naive bayes classifiers and came across a bug: When the posterior probabilities of all classes are small, the result from the predict.naiveBayes function become NaNs. This is an issue with the treatment of the log-transformed probabilities inside the predict.naiveBayes function. Here is an example to demonstrate the problem (you might need to increase 'nvar' depending on your machine):
-------------------- 8< -------------------- N <- 100 nvar <- 60 varnames <- paste("v", 1:nvar, sep="") dat <- sapply(1:nvar, function(dummy) {c(rnorm(N/2, 0, 1), rnorm(N/2, 10, 1))}) colnames(dat) <- varnames out <- rep(c("a","b"), each=N/2) names(dat) <- varnames nb <- naiveBayes(x=dat, y=out) new.dat <- t(rnorm(nvar, 5, 0.1)) colnames(new.dat) <- varnames predict(nb, new.dat, type="raw") -------------------- 8< -------------------- the results of the last line is usually NaNs. As for the solution: To protect agains very small numbers, the e1071:::predict.naiveBayes function takes the probabilities into log-space and adds instead of multiplying probabilities. However, when calculating the posterior probabilities of each class (when type = "raw"), the log of the probabilities are exponentiated, which defeats the purpose of the logspace transformation. I suggest the following change to the code: Towards the end of the predict.naiveBayes function, you currently do: L <- exp(L) L / sum(L) # this is what is returned you can instead use sapply(L, function(lp) {1 / sum(exp(L - lp))}) the above comes from the following equality: x / (x + y + z) = 1 / (1 + exp(log(y) - log(x)) + exp(log(z) - log(x))) Best wishes, /Ali Tofigh ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.