Yes, you are right. 2016-05-30 2:34 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>:
> > Thanks Yanbo. > > So, you mean that if I have a variable which is of type double but I want > to treat it like String in my model I just have to cast those columns into > string and simply run the glm model. String columns will be directly > one-hot encoded by the glm provided by sparkR ? > > Just wanted to clarify as in R we need to apply as.factor for categorical > variables. > > val dfNew = df.withColumn("C0",df.col("C0").cast("String")) > > > Abhi !! > > On Mon, May 30, 2016 at 2:58 PM, Yanbo Liang <yblia...@gmail.com> wrote: > >> Hi Abhi, >> >> In SparkR glm, category features (columns of type string) will be one-hot >> encoded automatically. >> So pre-processing like `as.factor` is not necessary, you can directly >> feed your data to the model training. >> >> Thanks >> Yanbo >> >> 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>: >> >>> Hi , >>> >>> I want to run glm variant of sparkR for my data that is there in a csv >>> file. >>> >>> I see that the glm function in sparkR takes a spark dataframe as input. >>> >>> Now, when I read a file from csv and create a spark dataframe, how could >>> I take care of the factor variables/columns in my data ? >>> >>> Do I need to convert it to a R dataframe, convert to factor using >>> as.factor and create spark dataframe and run glm over it ? >>> >>> But, running as.factor over big dataset is not possible. >>> >>> Please suggest what is the best way to acheive this ? >>> >>> What pre-processing should be done, and what is the best way to achieve >>> it ? >>> >>> >>> Thanks, >>> Abhi >>> >> >> >