Hi All,

There are several categorical columns in my dataset as follows:
[image: Inline images 1]

How can I transform values in each (categorical) columns into numeric using
StringIndexer so that the resulting DataFrame can be feed into
VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical
column. But that sounds hilarious, I know.
A possible workaround
<https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
PySpark is combining several StringIndexer on a list and use a Pipeline to
execute them all as follows:

from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column,
outputCol=column+"_index").fit(df) for column in
list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()

How I can do the same in Scala? I tried the following:

    val featureCol = trainingDF.columns
    var indexers: Array[StringIndexer] = null

    for (colName <- featureCol) {
      val index = new StringIndexer()
        .setInputCol(colName)
        .setOutputCol(colName + "_indexed")
        //.fit(trainDF)
      indexers = indexers :+ index
    }

     val pipeline = new Pipeline()
                    .setStages(indexers)
    val newDF = pipeline.fit(trainingDF).transform(trainingDF)
    newDF.show()

However, I am experiencing NullPointerException at

for (colName <- featureCol)

I am sure, I am doing something wrong. Any suggestion?



Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

Reply via email to