Re: StringIndexer on several columns in a DataFrame with Scala

Weichen Xu Mon, 30 Oct 2017 02:40:58 -0700

Yes I am working on this. Sorry for late, but I will try to submit PR ASAP.
Thanks!


On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> For now, you must follow this approach of constructing a pipeline
> consisting of a StringIndexer for each categorical column. See
> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to
> allow multiple columns for StringIndexer, which is being worked on
> currently.
>
> The reason you're seeing a NPE is:
>
> var indexers: Array[StringIndexer] = null
>
> and then you're trying to append an element to something that is null.
>
> Try this instead:
>
> var indexers: Array[StringIndexer] = Array()
>
>
> But even better is a more functional approach:
>
> val indexers = featureCol.map { colName =>
>
>   new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
>
> }
>
>
> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi All,
>>
>> There are several categorical columns in my dataset as follows:
>> [image: grafik.png]
>>
>> How can I transform values in each (categorical) columns into numeric
>> using StringIndexer so that the resulting DataFrame can be feed into
>> VectorAssembler to generate a feature vector?
>>
>> A naive approach that I can try using StringIndexer for each categorical
>> column. But that sounds hilarious, I know.
>> A possible workaround
>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
>> PySpark is combining several StringIndexer on a list and use a Pipeline
>> to execute them all as follows:
>>
>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
>> indexers = [StringIndexer(inputCol=column, 
>> outputCol=column+"_index").fit(df) for column in 
>> list(set(df.columns)-set(['date'])) ]
>> pipeline = Pipeline(stages=indexers)
>> df_r = pipeline.fit(df).transform(df)
>> df_r.show()
>>
>> How I can do the same in Scala? I tried the following:
>>
>>     val featureCol = trainingDF.columns
>>     var indexers: Array[StringIndexer] = null
>>
>>     for (colName <- featureCol) {
>>       val index = new StringIndexer()
>>         .setInputCol(colName)
>>         .setOutputCol(colName + "_indexed")
>>         //.fit(trainDF)
>>       indexers = indexers :+ index
>>     }
>>
>>      val pipeline = new Pipeline()
>>                     .setStages(indexers)
>>     val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>>     newDF.show()
>>
>> However, I am experiencing NullPointerException at
>>
>> for (colName <- featureCol)
>>
>> I am sure, I am doing something wrong. Any suggestion?
>>
>>
>>
>> Regards,
>> _________________________________
>> *Md. Rezaul Karim*, BSc, MSc
>> Researcher, INSIGHT Centre for Data Analytics
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html
>> <http://139.59.184.114/index.html>
>>
>

Re: StringIndexer on several columns in a DataFrame with Scala

Reply via email to