Hi , I want to run glm variant of sparkR for my data that is there in a csv file.
I see that the glm function in sparkR takes a spark dataframe as input. Now, when I read a file from csv and create a spark dataframe, how could I take care of the factor variables/columns in my data ? Do I need to convert it to a R dataframe, convert to factor using as.factor and create spark dataframe and run glm over it ? But, running as.factor over big dataset is not possible. Please suggest what is the best way to acheive this ? What pre-processing should be done, and what is the best way to achieve it ? Thanks, Abhi