Re: NA value handling in sparkR

Felix Cheung Wed, 27 Jan 2016 08:16:04 -0800

That's correct - and because spark-csv as Spark package is not specifically 
aware of R's notion of  NA and interprets it as a string value.
On the other hand, R native NA is converted to NULL on Spark when creating a 
Spark DataFrame from a R data.frame. 
https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/




    _____________________________
From: Devesh Raj Singh <raj.deves...@gmail.com>
Sent: Wednesday, January 27, 2016 3:19 AM
Subject: Re: NA value handling in sparkR
To: Deborah Siegel <deborah.sie...@gmail.com>
Cc:  <user@spark.apache.org>


       Hi,       
          

While dealing with missing values with R and SparkR I observed the following. 
Please tell me if I am right or wrong?    


    

Missing values in native R are represented with a logical constant-NA. SparkR 
DataFrames represents missing values with NULL. If you use createDataFrame() to 
turn a local R data.frame into a distributed SparkR DataFrame, SparkR will 
automatically convert NA to NULL.     

                            However, if you are creating a SparkR DataFrame by 
reading in data from a file using read.df(), you may have strings of "NA", but 
not R logical constant NA missing value representations. String "NA" is not 
automatically converted to NULL.          
       On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel     
<deborah.sie...@gmail.com> wrote:    
               Maybe not ideal, but since read.df is inferring all columns from 
the csv containing "NA" as type of strings, one could filter them rather than 
using dropna().              
                           filtered_aq <- filter(aq, aq$Ozone != "NA" & 
aq$Solar_R != "NA")                      head(filtered_aq)                      
                      Perhaps it would be better to have an option for read.df 
to convert any "NA" it encounters into null types, like createDataFrame does 
for <NA>, and then one would be able to use dropna() etc.                       
 
                      
                                            
                 On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh          
<raj.deves...@gmail.com> wrote:         
                              Hi,                       
                                  Yes you are right.                         
                                     I think the problem is with reading of csv 
files. read.df is not considering NAs in the CSV file             
                                              
                                  So what would be a workable solution in 
dealing with NAs in csv files?                                  
                                  
                                                                   
                           On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel      
         <deborah.sie...@gmail.com> wrote:              
                                             Hi Devesh,                 
                
I'm not certain why that's happening, and it looks like it doesn't happen if 
you use createDataFrame directly:                
aq <- createDataFrame(sqlContext,airquality)                
head(dropna(aq,how="any"))                
                
If I had to guess.. dropna(), I believe, drops null values. I suppose its 
possible that createDataFrame converts R's <NA> values to null, so dropna() 
works with that. But perhaps read.df() does not convert R <NA>s to null, as 
those are most likely interpreted as strings when they come in from the csv. 
Just a guess, can anyone confirm?                                 
                                                 Deb                 
                                   
                                                    
                                                    
                                                    
                                                    
                                                                                
                                  
                                     On Sun, Jan 24, 2016 at 11:05 PM, Devesh 
Raj Singh                    <raj.deves...@gmail.com> wrote:                   
                                                            Hi,                 
                          
                                                                

I have applied the following code on airquality dataset available in R , which 
has some missing values. I want to omit the rows which has NAs                  
    

library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" 
"com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')                      

sc <- sparkR.init("local",sparkHome = 
"/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")                      

sqlContext <- sparkRSQL.init(sc)                      

path<-"/Users/devesh/work/airquality/"                      

aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", 
header="true", inferSchema="true")                      

head(dropna(aq,how="any"))                      

I am getting the output as                      

Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 
149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6    
                  

The NAs still exist in the output. Am I missing something here?                 
                                                      
                        -- 
                                                                                
                      Warm regards,                           
                          Devesh.                          
                                                                                
                                                                   
                                                                                
         
             
                           
                                               -- 
                                                          Warm regards,         
       
               Devesh.               
                                                                
                             
   
       
   --    
                  Warm regards,      
     Devesh.

Re: NA value handling in sparkR

Reply via email to