To add to Rolf's point, a tool for imputation in R is aregImpute in
Frank Harrell's Hmisc package.
I am not sure if the discussion of past GPA as the missing variable is
literal or merely illustrative. If literal, is the gpa missing because
it was not reported (ie, it exists but was not reported), or because
it does not exist? If the latter, you may wish to analyze the
individuals with no prior GPA separately, since that seems to be a
profound difference.
Regards,
James
On Aug 2, 2009, at 8:22 PM, Rolf Turner <r.tur...@auckland.ac.nz> wrote:
On 3/08/2009, at 11:32 AM, Noah Silverman wrote:
Rolf,
Point taken.
However, some of the variables in the experiment simply don't have
data for some of the examples.
Since I'm training an SVM that will complain about an NA, how do
you suggest I handle this.
Imagine a model predicting student performance/grades/whatever.
One variable might be "past_gpa".
If we have some students with no history, what do you put for that
column. NA is more "correct", but won't work with an SVM.
I'm always happy to learn...
I know next to nothing about support vector machines. Despite my
ignorance
I remain suspicious of the concept. I suspect that fortune("machine
learning")
is relevant.
If you have a data set that contains intrinsic NAs and you wish to
apply SVM
methods to these data, then you will need to understand how SVMs
work and decide
what *should* be done to handle these NAs. My vague understanding
is that SVM
tries to build pairs of hyperplanes, as widely separated as
possible, between classes of
data. This requires that each datum be representable as point in n-
dimensional
space. A datum one of whose entries is NA is not (really) such a
point. Moreover
it sure as hell isn't the same as the point produce by replacing
that NA by 0.
To take your example involving past_gpa --- a student who has no
past gpa is very
likely to be very different from a student who has previously
studied and
failed everything!
What you need is a *metric* which tells you the distance between a
point with an NA
in it and another point. The other point may have no NAs amongst
its coordinates,
or it might have an NA in a *different* coordinate. I.e. you need
to define a distance
between points, some of whose coordinates may be missing, in a
*meaningful* way.
After doing that, you will need (!!!) to adapt the SVM software to
work with this
new metric/distance instead of the Euclidean metric. This may
possibly all have
been done already by someone, somewhere. I dunno.
Of course your proposed technique of replacing NAs by zeroes does
define a distance
between such points. But I doubt me an it be meaningful.
OTOH how meaningful is the Euclidean metric between points whose
entries are numeric
but in completely unrelated units (gpa, age, weight, income, ...) ???
I'm sure this is little-to-no help in reality. But I suspect that
little-to-no help
is possible.
A thought that just occurred to me: there ***might*** be some
milage in trying
to ``impute'' values for the NAs in your data. However sensible
imputation requires
(so I believe) pretty stringent conditions --- like multivariate
Gaussianity? ---
on your data, which are unlikely to be satisfied. (Else why are you
using SVM
techniques in the first place?) Frank Harrell might have something
useful --- or
caustic (or both) --- to say on this issue.
cheers,
Rolf Turner
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:
9}}
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.