Re: NA value handling in sparkR

Devesh Raj Singh Tue, 26 Jan 2016 17:07:05 -0800

Hi,
If we want to create dummy variables out of categorical columns for data
manipulation purpose, how would we do it in sparkR?


On Wednesday, January 27, 2016, Deborah Siegel <deborah.sie...@gmail.com>
wrote:

> While fitting the currently available sparkR models, such as glm for
> linear and logistic regression, columns which contains strings are one-hot
> encoded behind the scenes, as part of the parsing of the RFormula. Does
> that help, or did you have something else in mind?
>
>
>
>
>> Thank you so much for your mail. It is working .
>>       I have another small question in sparkR - can we create dummy
>> variables for categorical columns ( like in R we have " dummies" package)
>> eg in iris dataset we have Spieces as a categorical column so 3 dummy
>> variables columns like setosa, virginica would be created with 0 and 1 as
>> values
>
>
> On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.sie...@gmail.com
> <javascript:_e(%7B%7D,'cvml','deborah.sie...@gmail.com');>> wrote:
>
>> Maybe not ideal, but since read.df is inferring all columns from the csv
>> containing "NA" as type of strings, one could filter them rather than using
>> dropna().
>>
>> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
>> head(filtered_aq)
>>
>> Perhaps it would be better to have an option for read.df to convert any
>> "NA" it encounters into null types, like createDataFrame does for <NA>, and
>> then one would be able to use dropna() etc.
>>
>>
>>
>> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','raj.deves...@gmail.com');>> wrote:
>>
>>> Hi,
>>>
>>> Yes you are right.
>>>
>>> I think the problem is with reading of csv files. read.df is not
>>> considering NAs in the CSV file
>>>
>>> So what would be a workable solution in dealing with NAs in csv files?
>>>
>>>
>>>
>>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <
>>> deborah.sie...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','deborah.sie...@gmail.com');>> wrote:
>>>
>>>> Hi Devesh,
>>>>
>>>> I'm not certain why that's happening, and it looks like it doesn't
>>>> happen if you use createDataFrame directly:
>>>> aq <- createDataFrame(sqlContext,airquality)
>>>> head(dropna(aq,how="any"))
>>>>
>>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>>> its possible that createDataFrame converts R's <NA> values to null, so
>>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>>> null, as those are most likely interpreted as strings when they come in
>>>> from the csv. Just a guess, can anyone confirm?
>>>>
>>>> Deb
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>>> raj.deves...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','raj.deves...@gmail.com');>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have applied the following code on airquality dataset available in R
>>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>>
>>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>>
>>>>> sc <- sparkR.init("local",sparkHome =
>>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>>
>>>>> sqlContext <- sparkRSQL.init(sc)
>>>>>
>>>>> path<-"/Users/devesh/work/airquality/"
>>>>>
>>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>>> header="true", inferSchema="true")
>>>>>
>>>>> head(dropna(aq,how="any"))
>>>>>
>>>>> I am getting the output as
>>>>>
>>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72
>>>>> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>>> 14.9 66 5 6
>>>>>
>>>>> The NAs still exist in the output. Am I missing something here?
>>>>>
>>>>> --
>>>>> Warm regards,
>>>>> Devesh.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>

-- 
Warm regards,
Devesh.

Re: NA value handling in sparkR

Reply via email to