Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu Mon, 18 Dec 2017 18:45:50 -0800

Hi Marco,

If you add assembler at the first of the pipeline, like:
```
 val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt,
labelConverter))
```


Which error do you got ?

I think it can work fine if the `assembler` added into pipeline.

Thanks.

On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hello Weichen
>  sorry to bother you again with my ML issue... but i feel you have more
> experience than i do in this and perhaps you can suggest me  if i am
> following the correct steps, as i seem to get confused by different
> examples on Decision Treees
>
> So, as a starting point i have this dataframe
>
> [BI-RADS, Age, Shape, Margin,Density,Severity]
>
> The label is 'Severity' and all others are features
> I am following these steps and i was wondering if you can advise if i am
> doing the correct thing , as i am unable to add the assembler at the
> beginning of the pipeilne, resorting instead to the following code
> <inputData is the original DataFrame>
>
>     val assembler = new VectorAssembler().
>       setInputCols(inputData.columns.filter(_ != "Severity")).
>       setOutputCol("features")
>
>     val data = assembler.transform(inputData)
>
>     val labelIndexer = new StringIndexer()
>       .setInputCol("Severity")
>       .setOutputCol("indexedLabel")
>       .fit(data)
>
>     val featureIndexer =
>       new VectorIndexer()
>       .setInputCol("features")
>       .setOutputCol("indexedFeatures")
>       .setMaxCategories(5) // features with > 4 distinct values are
> treated as continuous.
>       .fit(data)
>
>     val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
>     // Train a DecisionTree model.
>     val dt = new DecisionTreeClassifier()
>       .setLabelCol("indexedLabel")
>       .setFeaturesCol("indexedFeatures")
>
>     // Convert indexed labels back to original labels.
>       val labelConverter = new IndexToString()
>       .setInputCol("prediction")
>       .setOutputCol("predictedLabel")
>       .setLabels(labelIndexer.labels)
>
>     // Chain indexers and tree in a Pipeline.
>     val pipeline = new Pipeline()
>       .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
>
>     trainingData.cache()
>     testData.cache()
>
>
>     // Train model. This also runs the indexers.
>     val model = pipeline.fit(trainingData)
>
>     // Make predictions.
>     val predictions = model.transform(testData)
>
>     // Select example rows to display.
>     predictions.select("predictedLabel", "indexedLabel",
> "indexedFeatures").show(5)
>
>     // Select (prediction, true label) and compute test error.
>     val evaluator = new MulticlassClassificationEvaluator()
>       .setLabelCol("indexedLabel")
>       .setPredictionCol("prediction")
>       .setMetricName("accuracy")
>     val accuracy = evaluator.evaluate(predictions)
>     println("Test Error = " + (1.0 - accuracy))
>
> Could you advise if this is the proper way to follow when using an
> Assembler?
> I was unable to add the Assembler at the beginning of the pipeline... it
> seems it dint get invoked as , at the moment of calling the FeatureIndexer,
> the column 'features' was not found
>
> this is not urgent, i'll appreciate ifyou can give me your comments
> kind regards
>  marco
>
>
>
>
>
>
> On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <weichen...@databricks.com>
> wrote:
>
>> Hi Marco,
>>
>> Yes you can apply `VectorAssembler` first in the pipeline to assemble
>> multiple features column.
>>
>> Thanks.
>>
>> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hello Wei
>>>  Thanks, i should have c hecked the data
>>> My data has this format
>>> |col1|col2|col3|label|
>>>
>>> so it looks like i cannot use VectorIndexer directly (it accepts a
>>> Vector column).
>>> I am guessing what i should do is something like this (given i have few
>>> categorical features)
>>>
>>> val assembler = new VectorAssembler().
>>>       setInputCols(inputData.columns.filter(_ != "Label")).
>>>       setOutputCol("features")
>>>
>>>     val transformedData = assembler.transform(inputData)
>>>
>>>
>>>     val featureIndexer =
>>>       new VectorIndexer()
>>>       .setInputCol("features")
>>>       .setOutputCol("indexedFeatures")
>>>       .setMaxCategories(5) // features with > 4 distinct values are
>>> treated as continuous.
>>>       .fit(transformedData)
>>>
>>> ?
>>> Apologies for the basic question btu last time i worked on an ML project
>>> i was using Spark 1.x
>>>
>>> kr
>>>  marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote:
>>>
>>>> Hi, Marco,
>>>>
>>>> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d
>>>> ata.txt")
>>>>
>>>> The data now include a feature column with name "features",
>>>>
>>>> val featureIndexer = new VectorIndexer()
>>>>   .setInputCol("features")   <------ Here specify the "features" column to 
>>>> index.
>>>>   .setOutputCol("indexedFeatures")
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> HI all
>>>>>  i am trying to run a sample decision tree, following examples here
>>>>> (for Mllib)
>>>>>
>>>>> https://spark.apache.org/docs/latest/ml-classification-regre
>>>>> ssion.html#decision-tree-classifier
>>>>>
>>>>> the example seems to use  a Vectorindexer, however i am missing
>>>>> something.
>>>>> How does the featureIndexer knows which columns are features?
>>>>> Isnt' there something missing?  or the featuresIndexer is able to
>>>>> figure out by itself
>>>>> which columns of teh DAtaFrame are features?
>>>>>
>>>>> val labelIndexer = new StringIndexer()
>>>>>   .setInputCol("label")
>>>>>   .setOutputCol("indexedLabel")
>>>>>   .fit(data)// Automatically identify categorical features, and index 
>>>>> them.val featureIndexer = new VectorIndexer()
>>>>>   .setInputCol("features")
>>>>>   .setOutputCol("indexedFeatures")
>>>>>   .setMaxCategories(4) // features with > 4 distinct values are treated 
>>>>> as continuous.
>>>>>   .fit(data)
>>>>>
>>>>> Using this code i am getting back this exception
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field 
>>>>> "features" does not exist.
>>>>>         at 
>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>         at 
>>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>>>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>>>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>>>>         at 
>>>>> org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>>>>         at 
>>>>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>>>>         at 
>>>>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>>>>         at 
>>>>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>>>>         at 
>>>>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>>>>
>>>>> what am i missing?
>>>>>
>>>>> w/kindest regarsd
>>>>>
>>>>>  marco
>>>>>
>>>>>
>>>>
>>
>

Re: Please Help with DecisionTree/FeatureIndexer

Reply via email to