On 3/08/2009, at 11:32 AM, Noah Silverman wrote:

Rolf,

Point taken.

However, some of the variables in the experiment simply don't have data for some of the examples.

Since I'm training an SVM that will complain about an NA, how do you suggest I handle this.


Imagine a model predicting student performance/grades/whatever.

One variable might be "past_gpa".

If we have some students with no history, what do you put for that column. NA is more "correct", but won't work with an SVM.

I'm always happy to learn...

I know next to nothing about support vector machines. Despite my ignorance I remain suspicious of the concept. I suspect that fortune("machine learning")
is relevant.

If you have a data set that contains intrinsic NAs and you wish to apply SVM methods to these data, then you will need to understand how SVMs work and decide what *should* be done to handle these NAs. My vague understanding is that SVM tries to build pairs of hyperplanes, as widely separated as possible, between classes of data. This requires that each datum be representable as point in n- dimensional space. A datum one of whose entries is NA is not (really) such a point. Moreover it sure as hell isn't the same as the point produce by replacing that NA by 0.

To take your example involving past_gpa --- a student who has no past gpa is very likely to be very different from a student who has previously studied and
failed everything!

What you need is a *metric* which tells you the distance between a point with an NA in it and another point. The other point may have no NAs amongst its coordinates, or it might have an NA in a *different* coordinate. I.e. you need to define a distance between points, some of whose coordinates may be missing, in a *meaningful* way.

After doing that, you will need (!!!) to adapt the SVM software to work with this new metric/distance instead of the Euclidean metric. This may possibly all have
been done already by someone, somewhere.  I dunno.

Of course your proposed technique of replacing NAs by zeroes does define a distance
between such points.  But I doubt me an it be meaningful.

OTOH how meaningful is the Euclidean metric between points whose entries are numeric
but in completely unrelated units (gpa, age, weight, income, ...) ???

I'm sure this is little-to-no help in reality. But I suspect that little-to-no help
is possible.

A thought that just occurred to me: there ***might*** be some milage in trying to ``impute'' values for the NAs in your data. However sensible imputation requires (so I believe) pretty stringent conditions --- like multivariate Gaussianity? --- on your data, which are unlikely to be satisfied. (Else why are you using SVM techniques in the first place?) Frank Harrell might have something useful --- or
caustic (or both) --- to say on this issue.

        cheers,

                Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to