Hi Marco, If you add assembler at the first of the pipeline, like: ``` val pipeline = new Pipeline() .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter)) ```
Which error do you got ? I think it can work fine if the `assembler` added into pipeline. Thanks. On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hello Weichen > sorry to bother you again with my ML issue... but i feel you have more > experience than i do in this and perhaps you can suggest me if i am > following the correct steps, as i seem to get confused by different > examples on Decision Treees > > So, as a starting point i have this dataframe > > [BI-RADS, Age, Shape, Margin,Density,Severity] > > The label is 'Severity' and all others are features > I am following these steps and i was wondering if you can advise if i am > doing the correct thing , as i am unable to add the assembler at the > beginning of the pipeilne, resorting instead to the following code > <inputData is the original DataFrame> > > val assembler = new VectorAssembler(). > setInputCols(inputData.columns.filter(_ != "Severity")). > setOutputCol("features") > > val data = assembler.transform(inputData) > > val labelIndexer = new StringIndexer() > .setInputCol("Severity") > .setOutputCol("indexedLabel") > .fit(data) > > val featureIndexer = > new VectorIndexer() > .setInputCol("features") > .setOutputCol("indexedFeatures") > .setMaxCategories(5) // features with > 4 distinct values are > treated as continuous. > .fit(data) > > val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2)) > // Train a DecisionTree model. > val dt = new DecisionTreeClassifier() > .setLabelCol("indexedLabel") > .setFeaturesCol("indexedFeatures") > > // Convert indexed labels back to original labels. > val labelConverter = new IndexToString() > .setInputCol("prediction") > .setOutputCol("predictedLabel") > .setLabels(labelIndexer.labels) > > // Chain indexers and tree in a Pipeline. > val pipeline = new Pipeline() > .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter)) > > trainingData.cache() > testData.cache() > > > // Train model. This also runs the indexers. > val model = pipeline.fit(trainingData) > > // Make predictions. > val predictions = model.transform(testData) > > // Select example rows to display. > predictions.select("predictedLabel", "indexedLabel", > "indexedFeatures").show(5) > > // Select (prediction, true label) and compute test error. > val evaluator = new MulticlassClassificationEvaluator() > .setLabelCol("indexedLabel") > .setPredictionCol("prediction") > .setMetricName("accuracy") > val accuracy = evaluator.evaluate(predictions) > println("Test Error = " + (1.0 - accuracy)) > > Could you advise if this is the proper way to follow when using an > Assembler? > I was unable to add the Assembler at the beginning of the pipeline... it > seems it dint get invoked as , at the moment of calling the FeatureIndexer, > the column 'features' was not found > > this is not urgent, i'll appreciate ifyou can give me your comments > kind regards > marco > > > > > > > On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <weichen...@databricks.com> > wrote: > >> Hi Marco, >> >> Yes you can apply `VectorAssembler` first in the pipeline to assemble >> multiple features column. >> >> Thanks. >> >> On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> Hello Wei >>> Thanks, i should have c hecked the data >>> My data has this format >>> |col1|col2|col3|label| >>> >>> so it looks like i cannot use VectorIndexer directly (it accepts a >>> Vector column). >>> I am guessing what i should do is something like this (given i have few >>> categorical features) >>> >>> val assembler = new VectorAssembler(). >>> setInputCols(inputData.columns.filter(_ != "Label")). >>> setOutputCol("features") >>> >>> val transformedData = assembler.transform(inputData) >>> >>> >>> val featureIndexer = >>> new VectorIndexer() >>> .setInputCol("features") >>> .setOutputCol("indexedFeatures") >>> .setMaxCategories(5) // features with > 4 distinct values are >>> treated as continuous. >>> .fit(transformedData) >>> >>> ? >>> Apologies for the basic question btu last time i worked on an ML project >>> i was using Spark 1.x >>> >>> kr >>> marco >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote: >>> >>>> Hi, Marco, >>>> >>>> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d >>>> ata.txt") >>>> >>>> The data now include a feature column with name "features", >>>> >>>> val featureIndexer = new VectorIndexer() >>>> .setInputCol("features") <------ Here specify the "features" column to >>>> index. >>>> .setOutputCol("indexedFeatures") >>>> >>>> >>>> Thanks. >>>> >>>> >>>> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com> >>>> wrote: >>>> >>>>> HI all >>>>> i am trying to run a sample decision tree, following examples here >>>>> (for Mllib) >>>>> >>>>> https://spark.apache.org/docs/latest/ml-classification-regre >>>>> ssion.html#decision-tree-classifier >>>>> >>>>> the example seems to use a Vectorindexer, however i am missing >>>>> something. >>>>> How does the featureIndexer knows which columns are features? >>>>> Isnt' there something missing? or the featuresIndexer is able to >>>>> figure out by itself >>>>> which columns of teh DAtaFrame are features? >>>>> >>>>> val labelIndexer = new StringIndexer() >>>>> .setInputCol("label") >>>>> .setOutputCol("indexedLabel") >>>>> .fit(data)// Automatically identify categorical features, and index >>>>> them.val featureIndexer = new VectorIndexer() >>>>> .setInputCol("features") >>>>> .setOutputCol("indexedFeatures") >>>>> .setMaxCategories(4) // features with > 4 distinct values are treated >>>>> as continuous. >>>>> .fit(data) >>>>> >>>>> Using this code i am getting back this exception >>>>> >>>>> Exception in thread "main" java.lang.IllegalArgumentException: Field >>>>> "features" does not exist. >>>>> at >>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>>>> at >>>>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>>>> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) >>>>> at scala.collection.AbstractMap.getOrElse(Map.scala:59) >>>>> at >>>>> org.apache.spark.sql.types.StructType.apply(StructType.scala:265) >>>>> at >>>>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) >>>>> at >>>>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141) >>>>> at >>>>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) >>>>> at >>>>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118) >>>>> >>>>> what am i missing? >>>>> >>>>> w/kindest regarsd >>>>> >>>>> marco >>>>> >>>>> >>>> >> >