Re: [R] Strange column shifting with read.table

James Pirruccello Sun, 02 Aug 2009 21:25:14 -0700

To add to Rolf's point, a tool for imputation in R is aregImpute inFrank Harrell's Hmisc package.

I am not sure if the discussion of past GPA as the missing variable isliteral or merely illustrative. If literal, is the gpa missing becauseit was not reported (ie, it exists but was not reported), or becauseit does not exist? If the latter, you may wish to analyze theindividuals with no prior GPA separately, since that seems to be aprofound difference.


Regards,

James



On Aug 2, 2009, at 8:22 PM, Rolf Turner <r.tur...@auckland.ac.nz> wrote:

On 3/08/2009, at 11:32 AM, Noah Silverman wrote:
Rolf,

Point taken.
However, some of the variables in the experiment simply don't havedata for some of the examples.
Since I'm training an SVM that will complain about an NA, how doyou suggest I handle this.
Imagine a model predicting student performance/grades/whatever.

One variable might be "past_gpa".
If we have some students with no history, what do you put for thatcolumn. NA is more "correct", but won't work with an SVM.
I'm always happy to learn...
I know next to nothing about support vector machines. Despite myignoranceI remain suspicious of the concept. I suspect that fortune("machinelearning")
is relevant.
If you have a data set that contains intrinsic NAs and you wish toapply SVMmethods to these data, then you will need to understand how SVMswork and decidewhat *should* be done to handle these NAs. My vague understandingis that SVMtries to build pairs of hyperplanes, as widely separated aspossible, between classes ofdata. This requires that each datum be representable as point in n-dimensionalspace. A datum one of whose entries is NA is not (really) such apoint. Moreoverit sure as hell isn't the same as the point produce by replacingthat NA by 0.
To take your example involving past_gpa --- a student who has nopast gpa is verylikely to be very different from a student who has previouslystudied and
failed everything!
What you need is a *metric* which tells you the distance between apoint with an NAin it and another point. The other point may have no NAs amongstits coordinates,or it might have an NA in a *different* coordinate. I.e. you needto define a distancebetween points, some of whose coordinates may be missing, in a*meaningful* way.
After doing that, you will need (!!!) to adapt the SVM software towork with thisnew metric/distance instead of the Euclidean metric. This maypossibly all have
been done already by someone, somewhere.  I dunno.
Of course your proposed technique of replacing NAs by zeroes doesdefine a distance
between such points.  But I doubt me an it be meaningful.
OTOH how meaningful is the Euclidean metric between points whoseentries are numeric
but in completely unrelated units (gpa, age, weight, income, ...) ???
I'm sure this is little-to-no help in reality. But I suspect thatlittle-to-no help
is possible.
A thought that just occurred to me: there ***might*** be somemilage in tryingto ``impute'' values for the NAs in your data. However sensibleimputation requires(so I believe) pretty stringent conditions --- like multivariateGaussianity? ---on your data, which are unlikely to be satisfied. (Else why are youusing SVMtechniques in the first place?) Frank Harrell might have somethinguseful --- or
caustic (or both) --- to say on this issue.

   cheers,

       Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Strange column shifting with read.table

Reply via email to