Hi, OBones. 1. which columns are features? For ml, use `setFeaturesCol` and `setLabelCol` to assign input column: https://spark.apache.org/docs/2.1.0/api/scala/index.html# org.apache.spark.ml.classification.DecisionTreeClassifier
2. which ones are categorical? For ml, use Transformer to create Vector. In your case, use VectorIndexer: http://spark.apache.org/docs/latest/ml-features.html#vectorindexer Above all, use Transformer / Estimator to create Vector, and use Estimator to train and test. On Thu, Jun 15, 2017 at 5:59 PM, OBones <obo...@free.fr> wrote: > Hello, > > I have written the following scala code to train a regression tree, based > on mllib: > > val conf = new SparkConf().setAppName("DecisionTreeRegressionExample") > val sc = new SparkContext(conf) > val spark = new SparkSession.Builder().getOrCreate() > > val sourceData = spark.read.format("com.databri > cks.spark.csv").option("header", "true").option("delimiter", > ";").load("C:\\Data\\source_file.csv") > > val data = sourceData.select($"X3".cast("double"), > $"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double")) > > val featureIndices = List("X1", "X2", "X3").map(data.columns.indexOf > (_)) > val targetIndex = data.columns.indexOf("Y") > > // WARNING: Indices in categoricalFeatures info are those inside the > vector we build from the featureIndices list > // Column 0 has two modalities, Column 1 has three > val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3)) > val impurity = "variance" > val maxDepth = 30 > val maxBins = 32 > > val labeled = data.map(row => LabeledPoint(row.getDouble(targetIndex), > Vectors.dense(featureIndices.map(row.getDouble(_)).toArray))) > > val model = DecisionTree.trainRegressor(labeled.rdd, > categoricalFeaturesInfo, impurity, maxDepth, maxBins) > > println(model.toDebugString) > > This works quite well, but I want some information from the model, one of > them being the features importance values. As it turns out, this is not > available on DecisionTreeModel but is available on > DecisionTreeRegressionModel from the ml package. > I then discovered that the ml package is more recent than the mllib > package which explains why it gives me more control over the trees I'm > building. > So, I tried to rewrite my sample code using the ml package and it is very > much easier to use, no need for the LabeledPoint transformation. Here is > the code I came up with: > > val dt = new DecisionTreeRegressor() > .setPredictionCol("Y") > .setImpurity("variance") > .setMaxDepth(30) > .setMaxBins(32) > > val model = dt.fit(data) > > println(model.toDebugString) > println(model.featureImportances.toString) > > However, I cannot find a way to specify which columns are features, which > ones are categorical and how many categories they have, like I used to do > with the mllib package. > I did look at the DecisionTreeRegressionExample.scala example found in > the source package, but it uses a VectorIndexer to automatically discover > the above information which is an unnecessary step in my case because I > already have the information at hand. > > The documentation found online (http://spark.apache.org/docs/ > latest/api/scala/index.html#org.apache.spark.ml.regression.D > ecisionTreeRegressor) did not help either because it does not indicate > the format for the featuresCol string property. > > Thanks in advance for your help. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >