Hi Nick, Both approaches worked and I realized my silly mistake too. Thank you so much.
@Xu, thanks for the update. Best, Regards, _________________________________ *Md. Rezaul Karim*, BSc, MSc Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html <http://139.59.184.114/index.html> On 30 October 2017 at 10:40, Weichen Xu <weichen...@databricks.com> wrote: > Yes I am working on this. Sorry for late, but I will try to submit PR > ASAP. Thanks! > > On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> For now, you must follow this approach of constructing a pipeline >> consisting of a StringIndexer for each categorical column. See >> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA >> to allow multiple columns for StringIndexer, which is being worked on >> currently. >> >> The reason you're seeing a NPE is: >> >> var indexers: Array[StringIndexer] = null >> >> and then you're trying to append an element to something that is null. >> >> Try this instead: >> >> var indexers: Array[StringIndexer] = Array() >> >> >> But even better is a more functional approach: >> >> val indexers = featureCol.map { colName => >> >> new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed") >> >> } >> >> >> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim < >> rezaul.ka...@insight-centre.org> wrote: >> >>> Hi All, >>> >>> There are several categorical columns in my dataset as follows: >>> [image: grafik.png] >>> >>> How can I transform values in each (categorical) columns into numeric >>> using StringIndexer so that the resulting DataFrame can be feed into >>> VectorAssembler to generate a feature vector? >>> >>> A naive approach that I can try using StringIndexer for each >>> categorical column. But that sounds hilarious, I know. >>> A possible workaround >>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in >>> PySpark is combining several StringIndexer on a list and use a Pipeline >>> to execute them all as follows: >>> >>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer >>> indexers = [StringIndexer(inputCol=column, >>> outputCol=column+"_index").fit(df) for column in >>> list(set(df.columns)-set(['date'])) ] >>> pipeline = Pipeline(stages=indexers) >>> df_r = pipeline.fit(df).transform(df) >>> df_r.show() >>> >>> How I can do the same in Scala? I tried the following: >>> >>> val featureCol = trainingDF.columns >>> var indexers: Array[StringIndexer] = null >>> >>> for (colName <- featureCol) { >>> val index = new StringIndexer() >>> .setInputCol(colName) >>> .setOutputCol(colName + "_indexed") >>> //.fit(trainDF) >>> indexers = indexers :+ index >>> } >>> >>> val pipeline = new Pipeline() >>> .setStages(indexers) >>> val newDF = pipeline.fit(trainingDF).transform(trainingDF) >>> newDF.show() >>> >>> However, I am experiencing NullPointerException at >>> >>> for (colName <- featureCol) >>> >>> I am sure, I am doing something wrong. Any suggestion? >>> >>> >>> >>> Regards, >>> _________________________________ >>> *Md. Rezaul Karim*, BSc, MSc >>> Researcher, INSIGHT Centre for Data Analytics >>> National University of Ireland, Galway >>> IDA Business Park, Dangan, Galway, Ireland >>> Web: http://www.reza-analytics.eu/index.html >>> <http://139.59.184.114/index.html> >>> >> >