Now i have huge columns about 5k -20k, so if i want to Calculate covariance 
matrix ,which is the best method or common method ?

 

 ------------------ ???????? ------------------
  ??????: "Felix Cheung";<[email protected]>;
 ????????: 2015??12??29??(??????) ????12:45
 ??????: "Andy Davidson"<[email protected]>; 
"zhangjp"<[email protected]>; "Yanbo Liang"<[email protected]>; 
 ????: "user"<[email protected]>; 
 ????: Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate 
covariance

 

  Make sure you add the csv spark package as this example here so that the 
source parameter in R read.df would work:
 

 
https://spark.apache.org/docs/latest/sparkr.html#from-data-sources
 



 _____________________________
From: Andy Davidson <[email protected]>
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <[email protected]>, Yanbo Liang <[email protected]>
Cc: user <[email protected]>


 Hi Yanbo 
 

 I use spark.csv to load my data set. I work with both Java and Python. I would 
recommend you print the first couple of rows and also print the schema to make 
sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like 
 

 

  
public class LoadTidyDataFrame {
 
    static  DataFrame fromCSV(SQLContext sqlContext, String file) {
 
        DataFrame df = sqlContext.read()
 
                .format("com.databricks.spark.csv")
 
                .option("inferSchema", "true")
 
                .option("header", "true")
 
                .load(file);
 
        
 
        return df;
 
    }
 
}

 

 

 

 From: Yanbo Liang < [email protected]> 
Date: Monday, December 28, 2015 at 2:30 AM 
To: zhangjp < [email protected]> 
Cc: "user @spark" < [email protected]> 
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance 

 

  Load csv file:  df <- read.df(sqlContext, "file-path", source = 
"com.databricks.spark.csv", header = "true") 
 Calculate covariance: 
 cov <- cov(df, "col1", "col2") 
 

 Cheers 
 Yanbo 
 


 
 2015-12-28 17:21 GMT+08:00 zhangjp <[email protected]>: 
  hi  all, 
     I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate 
 covariance, how to do it .  
     thks.

Reply via email to