Hi,
If we want to create dummy variables out of categorical columns for data
manipulation purpose, how would we do it in sparkR?

On Wednesday, January 27, 2016, Deborah Siegel <deborah.sie...@gmail.com>
wrote:

> While fitting the currently available sparkR models, such as glm for
> linear and logistic regression, columns which contains strings are one-hot
> encoded behind the scenes, as part of the parsing of the RFormula. Does
> that help, or did you have something else in mind?
>
>
>
>
>> Thank you so much for your mail. It is working .
>>       I have another small question in sparkR - can we create dummy
>> variables for categorical columns ( like in R we have " dummies" package)
>> eg in iris dataset we have Spieces as a categorical column so 3 dummy
>> variables columns like setosa, virginica would be created with 0 and 1 as
>> values
>
>
> On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.sie...@gmail.com
> <javascript:_e(%7B%7D,'cvml','deborah.sie...@gmail.com');>> wrote:
>
>> Maybe not ideal, but since read.df is inferring all columns from the csv
>> containing "NA" as type of strings, one could filter them rather than using
>> dropna().
>>
>> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
>> head(filtered_aq)
>>
>> Perhaps it would be better to have an option for read.df to convert any
>> "NA" it encounters into null types, like createDataFrame does for <NA>, and
>> then one would be able to use dropna() etc.
>>
>>
>>
>> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','raj.deves...@gmail.com');>> wrote:
>>
>>> Hi,
>>>
>>> Yes you are right.
>>>
>>> I think the problem is with reading of csv files. read.df is not
>>> considering NAs in the CSV file
>>>
>>> So what would be a workable solution in dealing with NAs in csv files?
>>>
>>>
>>>
>>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <
>>> deborah.sie...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','deborah.sie...@gmail.com');>> wrote:
>>>
>>>> Hi Devesh,
>>>>
>>>> I'm not certain why that's happening, and it looks like it doesn't
>>>> happen if you use createDataFrame directly:
>>>> aq <- createDataFrame(sqlContext,airquality)
>>>> head(dropna(aq,how="any"))
>>>>
>>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>>> its possible that createDataFrame converts R's <NA> values to null, so
>>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>>> null, as those are most likely interpreted as strings when they come in
>>>> from the csv. Just a guess, can anyone confirm?
>>>>
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>>> raj.deves...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','raj.deves...@gmail.com');>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have applied the following code on airquality dataset available in R
>>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>>
>>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>>
>>>>> sc <- sparkR.init("local",sparkHome =
>>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>>
>>>>> sqlContext <- sparkRSQL.init(sc)
>>>>>
>>>>> path<-"/Users/devesh/work/airquality/"
>>>>>
>>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>>> header="true", inferSchema="true")
>>>>>
>>>>> head(dropna(aq,how="any"))
>>>>>
>>>>> I am getting the output as
>>>>>
>>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72
>>>>> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>>> 14.9 66 5 6
>>>>>
>>>>> The NAs still exist in the output. Am I missing something here?
>>>>>
>>>>> --
>>>>> Warm regards,
>>>>> Devesh.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>

-- 
Warm regards,
Devesh.

Reply via email to