Hi, OBones.

1. which columns are features?
For ml,
use `setFeaturesCol` and `setLabelCol` to assign input column:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#
org.apache.spark.ml.classification.DecisionTreeClassifier

2. which ones are categorical?
For ml, use Transformer to create Vector.
In your case,  use VectorIndexer:
http://spark.apache.org/docs/latest/ml-features.html#vectorindexer

Above all,
use Transformer / Estimator to create Vector, and use Estimator to train
and test.





On Thu, Jun 15, 2017 at 5:59 PM, OBones <obo...@free.fr> wrote:

> Hello,
>
> I have written the following scala code to train a regression tree, based
> on mllib:
>
>     val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
>     val sc = new SparkContext(conf)
>     val spark = new SparkSession.Builder().getOrCreate()
>
>     val sourceData = spark.read.format("com.databri
> cks.spark.csv").option("header", "true").option("delimiter",
> ";").load("C:\\Data\\source_file.csv")
>
>     val data = sourceData.select($"X3".cast("double"),
> $"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))
>
>     val featureIndices = List("X1", "X2", "X3").map(data.columns.indexOf
> (_))
>     val targetIndex = data.columns.indexOf("Y")
>
>     // WARNING: Indices in categoricalFeatures info are those inside the
> vector we build from the featureIndices list
>     // Column 0 has two modalities, Column 1 has three
>     val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
>     val impurity = "variance"
>     val maxDepth = 30
>     val maxBins = 32
>
>     val labeled = data.map(row => LabeledPoint(row.getDouble(targetIndex),
> Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))
>
>     val model = DecisionTree.trainRegressor(labeled.rdd,
> categoricalFeaturesInfo, impurity, maxDepth, maxBins)
>
>     println(model.toDebugString)
>
> This works quite well, but I want some information from the model, one of
> them being the features importance values. As it turns out, this is not
> available on DecisionTreeModel but is available on
> DecisionTreeRegressionModel from the ml package.
> I then discovered that the ml package is more recent than the mllib
> package which explains why it gives me more control over the trees I'm
> building.
> So, I tried to rewrite my sample code using the ml package and it is very
> much easier to use, no need for the LabeledPoint transformation. Here is
> the code I came up with:
>
>     val dt = new DecisionTreeRegressor()
>       .setPredictionCol("Y")
>       .setImpurity("variance")
>       .setMaxDepth(30)
>       .setMaxBins(32)
>
>     val model = dt.fit(data)
>
>     println(model.toDebugString)
>     println(model.featureImportances.toString)
>
> However, I cannot find a way to specify which columns are features, which
> ones are categorical and how many categories they have, like I used to do
> with the mllib package.
> I did look at the DecisionTreeRegressionExample.scala example found in
> the source package, but it uses a VectorIndexer to automatically discover
> the above information which is an unnecessary step in my case because I
> already have the information at hand.
>
> The documentation found online (http://spark.apache.org/docs/
> latest/api/scala/index.html#org.apache.spark.ml.regression.D
> ecisionTreeRegressor) did not help either because it does not indicate
> the format for the featuresCol string property.
>
> Thanks in advance for your help.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to