Re: Running glm in sparkR (data pre-processing step)

Yanbo Liang Mon, 30 May 2016 07:18:07 -0700

Yes, you are right.

2016-05-30 2:34 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>:


>
> Thanks Yanbo.
>
> So, you mean that if I have a variable which is of type double but I want
> to treat it like String in my model I just have to cast those columns into
> string and simply run the glm model. String columns will be directly
> one-hot encoded by the glm provided by sparkR ?
>
> Just wanted to clarify as in R we need to apply as.factor for categorical
> variables.
>
> val dfNew = df.withColumn("C0",df.col("C0").cast("String"))
>
>
> Abhi !!
>
> On Mon, May 30, 2016 at 2:58 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Hi Abhi,
>>
>> In SparkR glm, category features (columns of type string) will be one-hot
>> encoded automatically.
>> So pre-processing like `as.factor` is not necessary, you can directly
>> feed your data to the model training.
>>
>> Thanks
>> Yanbo
>>
>> 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>:
>>
>>> Hi ,
>>>
>>> I want to run glm variant of sparkR for my data that is there in a csv
>>> file.
>>>
>>> I see that the glm function in sparkR takes a spark dataframe as input.
>>>
>>> Now, when I read a file from csv and create a spark dataframe, how could
>>> I take care of the factor variables/columns in my data ?
>>>
>>> Do I need to convert it to a R dataframe, convert to factor using
>>> as.factor and create spark dataframe and run glm over it ?
>>>
>>> But, running as.factor over big dataset is not possible.
>>>
>>> Please suggest what is the best way to acheive this ?
>>>
>>> What pre-processing should be done, and what is the best way to achieve
>>> it  ?
>>>
>>>
>>> Thanks,
>>> Abhi
>>>
>>
>>
>

Re: Running glm in sparkR (data pre-processing step)

Reply via email to