You can find some more info about SparkR at https://spark.apache.org/docs/latest/sparkr.html
Looking at your sample app, with the provided content, you should be able to run it on SparkR with something like: #load SparkR with support for csv sparkR --packages com.databricks:spark-csv_2.10:1.0.3 sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) # get matrix from a file file <- "file:///...../matrix.csv" #read it into variable raw_data <- read.csv(file,sep=',',header=FALSE) #convert to a local dataframe localDF = data.frame(raw_data) # create the rdd rdd <- createDataFrame(sqlContext,localDF) printSchema(rdd) head(rdd) I was also trying to read the csv directly in R : df <- read.df(sqlContext, file, "com.databricks.spark.csv", header="false", sep=",") That worked, but then I was getting exceptions when i tried printSchema(df) head(df) 15/09/17 18:33:30 ERROR CsvRelation$: Exception while parsing line: 7,8,9. java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:49) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:82) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:61) at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:150) at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:130) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I will investigate this further and create a jira if necessary. On Wed, Sep 16, 2015 at 11:22 PM, Sun, Rui <rui....@intel.com> wrote: > The existing algorithms operating on R data.frame can't simply operate on > SparkR DataFrame. They have to be re-implemented to be based on SparkR > DataFrame API. > > -----Original Message----- > From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com] > Sent: Thursday, September 17, 2015 3:30 AM > To: user@spark.apache.org > Subject: SparkR - calling as.vector() with rdd dataframe causes error > > Hi, > I have a library of clustering algorithms that I'm trying to run in the > SparkR interactive shell. (I am working on a proof of concept for a > document classification tool.) Each algorithm takes a term document matrix > in the form of a dataframe. When I pass the method a local dataframe, the > clustering algorithm works correctly, but when I pass it a spark rdd, it > gives an error trying to coerce the data into a vector. Here is the code, > that I'm calling within SparkR: > > # get matrix from a file > file <- > > "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv" > > #read it into variable > raw_data <- read.csv(file,sep=',',header=FALSE) > > #convert to a local dataframe > localDF = data.frame(raw_data) > > # create the rdd > rdd <- createDataFrame(sqlContext,localDF) > > #call the algorithm with the localDF - this works result <- > galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5) > > #call with the rdd - this produces error result <- galileo(rdd, > model='hclust',dist='euclidean',link='ward',K=5) > > Error in as.vector(data) : > no method for coercing this S4 class to a vector > > > I get the same error if I try to directly call as.vector(rdd) as well. > > Is there a reason why this works for localDF and not rdd? Should I be > doing something else to coerce the object into a vector? > > Thanks, > Ellen > > > -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/