Re: [R] Nominal variables in SVM?

Achim Zeileis Wed, 12 Aug 2009 14:27:59 -0700

On Wed, 12 Aug 2009, Noah Silverman wrote:

Hi,
The answers to my previous question about nominal variables has lead meto a more important question.
What is the "best practice" way to feed nominal variable to an SVM.

As some of the previous posters have already indicated: The data structurefor storing categorical (including nominal) variables in R is a "factor".

Your comment about "truly nominal" is wrong. A character variable is acharacter variable, not necessarily a categorical variable. Categoricalmeans that the answer falls into one of a finite number of knowncategories, known as "levels" in R's "factor" class.


If you start out from character information:

  x <- c("red", "red", "blue", "green", "blue")

You can turn it into a factor via:

  x <- factor(x, levels = c("red", "green", "blue"))

R now knows how to do certain things with such a variable, e.g., producesuseful summaries or knows how to deal with it in regression problems:


  model.matrix(~ x)

which seems to be what you asked for. Moreover, you don't need call thisyourself but most regression functions in R will do that for you(including svm() in "e1071" or ksvm() in "kernlab", among others).

In short: Keep your categorical variables as "factor" columns in a"data.frame" and use the formula interface of svm()/ksvm() and you arefine.

For example:
color = ("red, "blue", "green")

I could translate that into an index so I wind up with
color= (1,2,3)
But my concern is that the SVM will now think that the values are numeric in"range" and not discrete conditions.
Another thought would be to create 3 binary variables from the single colorvariable, so I have:
red = (0,1)
blue = (0,1)
green = (0,1)
A example fed to the SVM would have one positive and two negative values toindicate the color value:
i.e. for a blue example:
red = 0, blue =1 , green = 0
Or, do any of the SVM packages intelligently handle this internally so that Idon't have to mess with it. If so, do I need to be concerned about different"translation" of the data if the test data set isn't exactly the same as thetraining set.
For example:
training data  =  color ("red, "blue", "green")
test data = color ("red, "green")
How would I be sure that the "red" and "green" examples get encoded the sameso that the SVM is accurate?
Thanks in advance!!

-N

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Nominal variables in SVM?

Reply via email to