Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

Andy Davidson Mon, 28 Dec 2015 10:24:24 -0800

Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I
would recommend you print the first couple of rows and also print the schema
to make sure your data is loaded as you expect. You might find the following
code example helpful. You may need to programmatically set the schema
depending on what you data looks like



public class LoadTidyDataFrame {

    static  DataFrame fromCSV(SQLContext sqlContext, String file) {

        DataFrame df = sqlContext.read()

                .format("com.databricks.spark.csv")

                .option("inferSchema", "true")

                .option("header", "true")

                .load(file);

        

        return df;

    }

}




From:  Yanbo Liang <[email protected]>
Date:  Monday, December 28, 2015 at 2:30 AM
To:  zhangjp <[email protected]>
Cc:  "user @spark" <[email protected]>
Subject:  Re: how to use sparkR or spark MLlib load csv file on hdfs then
calculate covariance

> Load csv file:
> df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv",
> header = "true")
> Calculate covariance:
> cov <- cov(df, "col1", "col2")
> 
> Cheers
> Yanbo
> 
> 
> 2015-12-28 17:21 GMT+08:00 zhangjp <[email protected]>:
>> hi  all,
>>     I want  to use sparkR or spark MLlib  load csv file on hdfs then
>> calculate  covariance, how to do it .
>>     thks.
>

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

Reply via email to