Re: StringIndexer on several columns in a DataFrame with Scala

Md. Rezaul Karim Mon, 30 Oct 2017 03:12:09 -0700

Hi Nick,

Both approaches worked and I realized my silly mistake too. Thank you so
much.


@Xu, thanks for the update.





Best,

Regards,
_________________________________
*Md. Rezaul Karim*, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html
<http://139.59.184.114/index.html>

On 30 October 2017 at 10:40, Weichen Xu <weichen...@databricks.com> wrote:

> Yes I am working on this. Sorry for late, but I will try to submit PR
> ASAP. Thanks!
>
> On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> For now, you must follow this approach of constructing a pipeline
>> consisting of a StringIndexer for each categorical column. See
>> https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA
>> to allow multiple columns for StringIndexer, which is being worked on
>> currently.
>>
>> The reason you're seeing a NPE is:
>>
>> var indexers: Array[StringIndexer] = null
>>
>> and then you're trying to append an element to something that is null.
>>
>> Try this instead:
>>
>> var indexers: Array[StringIndexer] = Array()
>>
>>
>> But even better is a more functional approach:
>>
>> val indexers = featureCol.map { colName =>
>>
>>   new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
>>
>> }
>>
>>
>> On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <
>> rezaul.ka...@insight-centre.org> wrote:
>>
>>> Hi All,
>>>
>>> There are several categorical columns in my dataset as follows:
>>> [image: grafik.png]
>>>
>>> How can I transform values in each (categorical) columns into numeric
>>> using StringIndexer so that the resulting DataFrame can be feed into
>>> VectorAssembler to generate a feature vector?
>>>
>>> A naive approach that I can try using StringIndexer for each
>>> categorical column. But that sounds hilarious, I know.
>>> A possible workaround
>>> <https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe>in
>>> PySpark is combining several StringIndexer on a list and use a Pipeline
>>> to execute them all as follows:
>>>
>>> from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexer
>>> indexers = [StringIndexer(inputCol=column, 
>>> outputCol=column+"_index").fit(df) for column in 
>>> list(set(df.columns)-set(['date'])) ]
>>> pipeline = Pipeline(stages=indexers)
>>> df_r = pipeline.fit(df).transform(df)
>>> df_r.show()
>>>
>>> How I can do the same in Scala? I tried the following:
>>>
>>>     val featureCol = trainingDF.columns
>>>     var indexers: Array[StringIndexer] = null
>>>
>>>     for (colName <- featureCol) {
>>>       val index = new StringIndexer()
>>>         .setInputCol(colName)
>>>         .setOutputCol(colName + "_indexed")
>>>         //.fit(trainDF)
>>>       indexers = indexers :+ index
>>>     }
>>>
>>>      val pipeline = new Pipeline()
>>>                     .setStages(indexers)
>>>     val newDF = pipeline.fit(trainingDF).transform(trainingDF)
>>>     newDF.show()
>>>
>>> However, I am experiencing NullPointerException at
>>>
>>> for (colName <- featureCol)
>>>
>>> I am sure, I am doing something wrong. Any suggestion?
>>>
>>>
>>>
>>> Regards,
>>> _________________________________
>>> *Md. Rezaul Karim*, BSc, MSc
>>> Researcher, INSIGHT Centre for Data Analytics
>>> National University of Ireland, Galway
>>> IDA Business Park, Dangan, Galway, Ireland
>>> Web: http://www.reza-analytics.eu/index.html
>>> <http://139.59.184.114/index.html>
>>>
>>
>

Re: StringIndexer on several columns in a DataFrame with Scala

Reply via email to