Re: [R] Strange column shifting with read.table

Rolf Turner Sun, 02 Aug 2009 17:25:21 -0700


On 3/08/2009, at 11:32 AM, Noah Silverman wrote:

Rolf,

Point taken.
However, some of the variables in the experiment simply don't havedata for some of the examples.
Since I'm training an SVM that will complain about an NA, how doyou suggest I handle this.
Imagine a model predicting student performance/grades/whatever.

One variable might be "past_gpa".
If we have some students with no history, what do you put for thatcolumn. NA is more "correct", but won't work with an SVM.
I'm always happy to learn...

I know next to nothing about support vector machines. Despite myignoranceI remain suspicious of the concept. I suspect that fortune("machinelearning")

is relevant.

If you have a data set that contains intrinsic NAs and you wish toapply SVMmethods to these data, then you will need to understand how SVMs workand decidewhat *should* be done to handle these NAs. My vague understanding isthat SVMtries to build pairs of hyperplanes, as widely separated as possible,between classes ofdata. This requires that each datum be representable as point in n-dimensionalspace. A datum one of whose entries is NA is not (really) such apoint. Moreoverit sure as hell isn't the same as the point produce by replacing thatNA by 0.

To take your example involving past_gpa --- a student who has no pastgpa is verylikely to be very different from a student who has previously studiedand

failed everything!

What you need is a *metric* which tells you the distance between apoint with an NAin it and another point. The other point may have no NAs amongst itscoordinates,or it might have an NA in a *different* coordinate. I.e. you need todefine a distancebetween points, some of whose coordinates may be missing, in a*meaningful* way.

After doing that, you will need (!!!) to adapt the SVM software towork with thisnew metric/distance instead of the Euclidean metric. This maypossibly all have

been done already by someone, somewhere.  I dunno.

Of course your proposed technique of replacing NAs by zeroes doesdefine a distance

between such points.  But I doubt me an it be meaningful.

OTOH how meaningful is the Euclidean metric between points whoseentries are numeric

but in completely unrelated units (gpa, age, weight, income, ...) ???

I'm sure this is little-to-no help in reality. But I suspect thatlittle-to-no help

is possible.

A thought that just occurred to me: there ***might*** be some milagein tryingto ``impute'' values for the NAs in your data. However sensibleimputation requires(so I believe) pretty stringent conditions --- like multivariateGaussianity? ---on your data, which are unlikely to be satisfied. (Else why are youusing SVMtechniques in the first place?) Frank Harrell might have somethinguseful --- or

caustic (or both) --- to say on this issue.

        cheers,

                Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Strange column shifting with read.table

Reply via email to