Hi Marco, Yes you can apply `VectorAssembler` first in the pipeline to assemble multiple features column.
Thanks. On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hello Wei > Thanks, i should have c hecked the data > My data has this format > |col1|col2|col3|label| > > so it looks like i cannot use VectorIndexer directly (it accepts a Vector > column). > I am guessing what i should do is something like this (given i have few > categorical features) > > val assembler = new VectorAssembler(). > setInputCols(inputData.columns.filter(_ != "Label")). > setOutputCol("features") > > val transformedData = assembler.transform(inputData) > > > val featureIndexer = > new VectorIndexer() > .setInputCol("features") > .setOutputCol("indexedFeatures") > .setMaxCategories(5) // features with > 4 distinct values are > treated as continuous. > .fit(transformedData) > > ? > Apologies for the basic question btu last time i worked on an ML project i > was using Spark 1.x > > kr > marco > > > > > > > > > > On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote: > >> Hi, Marco, >> >> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_d >> ata.txt") >> >> The data now include a feature column with name "features", >> >> val featureIndexer = new VectorIndexer() >> .setInputCol("features") <------ Here specify the "features" column to >> index. >> .setOutputCol("indexedFeatures") >> >> >> Thanks. >> >> >> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com> >> wrote: >> >>> HI all >>> i am trying to run a sample decision tree, following examples here (for >>> Mllib) >>> >>> https://spark.apache.org/docs/latest/ml-classification-regre >>> ssion.html#decision-tree-classifier >>> >>> the example seems to use a Vectorindexer, however i am missing >>> something. >>> How does the featureIndexer knows which columns are features? >>> Isnt' there something missing? or the featuresIndexer is able to figure >>> out by itself >>> which columns of teh DAtaFrame are features? >>> >>> val labelIndexer = new StringIndexer() >>> .setInputCol("label") >>> .setOutputCol("indexedLabel") >>> .fit(data)// Automatically identify categorical features, and index >>> them.val featureIndexer = new VectorIndexer() >>> .setInputCol("features") >>> .setOutputCol("indexedFeatures") >>> .setMaxCategories(4) // features with > 4 distinct values are treated as >>> continuous. >>> .fit(data) >>> >>> Using this code i am getting back this exception >>> >>> Exception in thread "main" java.lang.IllegalArgumentException: Field >>> "features" does not exist. >>> at >>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>> at >>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >>> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) >>> at scala.collection.AbstractMap.getOrElse(Map.scala:59) >>> at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) >>> at >>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) >>> at >>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141) >>> at >>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) >>> at >>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118) >>> >>> what am i missing? >>> >>> w/kindest regarsd >>> >>> marco >>> >>> >>