Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks!
On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > For now, you must follow this approach of constructing a pipeline > consisting of a StringIndexer for each categorical column. See > https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to > allow multiple columns for StringIndexer, which is being worked on > currently. > > The reason you're seeing a NPE is: > > var indexers: Array[StringIndexer] = null > > and then you're trying to append an element to something that is null. > > Try this instead: > > var indexers: Array[StringIndexer] = Array() > > > But even better is a more functional approach: > > val indexers = featureCol.map { colName => > > new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed") > > } > > > On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim < > rezaul.ka...@insight-centre.org> wrote: > >> Hi All, >> >> There are several categorical columns in my dataset as follows: >> [image: grafik.png] >> >> How can I transform values in each (categorical) columns into numeric >> using StringIndexer so that the resulting DataFrame can be feed into >> VectorAssembler to generate a feature vector? >> >> A naive approach that I can try using StringIndexer for each categorical >> column. But that sounds hilarious, I know. >> A possible workaround >> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in >> PySpark is combining several StringIndexer on a list and use a Pipeline >> to execute them all as follows: >> >> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer >> indexers = [StringIndexer(inputCol=column, >> outputCol=column+"_index").fit(df) for column in >> list(set(df.columns)-set(['date'])) ] >> pipeline = Pipeline(stages=indexers) >> df_r = pipeline.fit(df).transform(df) >> df_r.show() >> >> How I can do the same in Scala? I tried the following: >> >> val featureCol = trainingDF.columns >> var indexers: Array[StringIndexer] = null >> >> for (colName <- featureCol) { >> val index = new StringIndexer() >> .setInputCol(colName) >> .setOutputCol(colName + "_indexed") >> //.fit(trainDF) >> indexers = indexers :+ index >> } >> >> val pipeline = new Pipeline() >> .setStages(indexers) >> val newDF = pipeline.fit(trainingDF).transform(trainingDF) >> newDF.show() >> >> However, I am experiencing NullPointerException at >> >> for (colName <- featureCol) >> >> I am sure, I am doing something wrong. Any suggestion? >> >> >> >> Regards, >> _________________________________ >> *Md. Rezaul Karim*, BSc, MSc >> Researcher, INSIGHT Centre for Data Analytics >> National University of Ireland, Galway >> IDA Business Park, Dangan, Galway, Ireland >> Web: http://www.reza-analytics.eu/index.html >> <http://139.59.184.114/index.html> >> >