From: [email protected]
To: [email protected]
Subject: RE: Decision tree: categorical variables
Date: Wed, 20 Aug 2014 12:09:52 -0700
Hi Xiangrui,
My data is in the following format:
0,1,5,A,8,1,M0,1,5,B,4,1,M1,0,2,B,7,0,U0,1,3,C,8,0,M0,0,5,C,1,0,M1,1,5,C,8,0,U0,0,5,B,8,0,M1,0,3,B,2,1,M0,1,5,B,8,0,F1,0,2,B,4,0,F0,1,5,A,8,0,F
I can create a map like this: val catmap = Map(3-> 3, 6 -> 2)
However, I am not sure what should I do when I parse the data. In the default
case, I parse it like:
val parsedData = data.map { line => val parts =
line.split(',').map(_.toDouble) LabeledPoint(parts(0),
Vectors.dense(parts.tail)) }
Do In need to explicitly do something for columns 3 and 6 or just specifying
map will suffice....
> Date: Tue, 19 Aug 2014 16:45:35 -0700
> Subject: Re: Decision tree: categorical variables
> From: [email protected]
> To: [email protected]
> CC: [email protected]
>
> The categorical features must be encoded into indices starting from 0:
> 0, 1, ..., numCategories - 1. Then you can provide the
> categoricalFeatureInfo map to specify which columns contain
> categorical features and the number of categories in each. Joseph is
> updating the user guide. But if you want to try something now, you can
> take look at the docs of DecisionTree.trainClassifier and
> trainRegressor:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360
>
> -Xiangrui
>
> On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak <[email protected]> wrote:
> > Hi All,
> >
> > Is there any example of MLlib decision tree handling categorical variables?
> > My dataset includes few categorical variables (20 out of 100 features) so
> > was interested in knowing how I can use the current version of decision tree
> > implementation to handle this situation? I looked at the LabeledData and not
> > sure if that the way to go..
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>