Function predict.lda() is just answering a different question from the one you 
are posing. It is answering the question, given the values on this object what 
is the probability of membership in each of the groups used to construct the 
discriminant functions in the first place. Those probabilities sum to 1 and are 
generally called the posterior probabilities. Your question is somewhat 
different, if this object was a member of group x, what is the probability that 
it would have values like these. These are typicality probabilities (how 
typical is this observation in this group). 

There are two ways to compute typicality probabilities. One is to use the 
reduced space defined by the discriminant functions and measure the distance of 
a new observation to the centroid of the group. This is the approach taken by 
SPSS which provides the typicality for the group which has the highest 
posterior probability. Huberty and Olejink recommend this procedure on the 
grounds that the probability distribution is known. The alternate approach 
which is used commonly in compositional analysis is to use Mahalanobis distance 
with the probability assumed to follow a chi square distribution. I am not 
aware of a package that has a function to produce either of these.

Huberty, Carl J. and Stephen Olejink. 2006. Applied Manova and Discriminant 
Analysis. Second Edition. Wiley-Interscience.

David L. Carlson
Department of Anthropology
Texas A&M University


-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Fraser D. Neiman
Sent: Friday, August 29, 2014 4:14 PM
To: r-help@r-project.org
Subject: [R] posterior probabilities from lda.predict

Dear All,

I have used the lda() function in the MASS library to estimate a set of 
discriminant functions to assign samples from a training set to one of six 
groups.  The cross validation generates nearly perfect predictions for samples 
in the training set.  Hooray!

Now I want to use lda.predict() to estimate both discriminant function scores 
and probabilities of group membership for a second set of samples whose group 
membership is unknown.  For each unknown sample, lda.predict() produces a six 
probabilities. These probabilities sum to one. So lda.predict() seems to assume 
that the unknown samples do, in fact, belong to one of the six groups.  

The problem is that it is nearly certain that some of the unknown samples in 
the second set do not belong to any of the six groups. For those samples, 
probabilities of group membership should be close to zero for all six groups.  
In fact, identifying which samples are unlikely to belong to any of the six 
groups is a major goal of the analysis. 

So the question is, what is lda.predict() doing behind the scenes to force the 
group membership probabilities to sum to one? How do I get it to not do this 
and produce probabilities that accurately reflect the large Mahalanobis 
distances of some of the unknown sample from any group centroid?\

I have searched the R-list archive on this and have found several folks asking 
similar questions, but no helpful answers.

Thanks very much!

Fraser
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to