Re: SparkR - calling as.vector() with rdd dataframe causes error

Luciano Resende Thu, 17 Sep 2015 18:37:33 -0700

You can find some more info about SparkR at
https://spark.apache.org/docs/latest/sparkr.html


Looking at your sample app, with the provided content, you should be able
to run it on SparkR with something like:

#load SparkR with support for csv
sparkR --packages com.databricks:spark-csv_2.10:1.0.3

sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)

# get matrix from a file
file <- "file:///...../matrix.csv"

#read it into variable
raw_data <- read.csv(file,sep=',',header=FALSE)

#convert to a local dataframe
localDF = data.frame(raw_data)

# create the rdd
rdd  <- createDataFrame(sqlContext,localDF)

printSchema(rdd)
head(rdd)

I was also trying to read the csv directly in R :
df <-  read.df(sqlContext, file, "com.databricks.spark.csv",
header="false", sep=",")

That worked, but then I was getting exceptions when i tried
printSchema(df)
head(df)

15/09/17 18:33:30 ERROR CsvRelation$: Exception while parsing line: 7,8,9.
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.spark.unsafe.types.UTF8String
    at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
    at
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
    at
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:49)
    at
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:247)
    at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:82)
    at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:61)
    at
com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:150)
    at
com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:130)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
    at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843)
    at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1843)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

I will investigate this further and create a jira if necessary.

On Wed, Sep 16, 2015 at 11:22 PM, Sun, Rui <rui....@intel.com> wrote:

> The existing algorithms operating on R data.frame can't simply operate on
> SparkR DataFrame. They have to be re-implemented to be based on SparkR
> DataFrame API.
>
> -----Original Message-----
> From: ekraffmiller [mailto:ellen.kraffmil...@gmail.com]
> Sent: Thursday, September 17, 2015 3:30 AM
> To: user@spark.apache.org
> Subject: SparkR - calling as.vector() with rdd dataframe causes error
>
> Hi,
> I have a library of clustering algorithms that I'm trying to run in the
> SparkR interactive shell. (I am working on a proof of concept for a
> document classification tool.) Each algorithm takes a term document matrix
> in the form of a dataframe.  When I pass the method a local dataframe, the
> clustering algorithm works correctly, but when I pass it a spark rdd, it
> gives an error trying to coerce the data into a vector.  Here is the code,
> that I'm calling within SparkR:
>
> # get matrix from a file
> file <-
>
> "/Applications/spark-1.5.0-bin-hadoop2.6/examples/src/main/resources/matrix.csv"
>
> #read it into variable
>  raw_data <- read.csv(file,sep=',',header=FALSE)
>
> #convert to a local dataframe
> localDF = data.frame(raw_data)
>
> # create the rdd
> rdd  <- createDataFrame(sqlContext,localDF)
>
> #call the algorithm with the localDF - this works result <-
> galileo(localDF, model='hclust',dist='euclidean',link='ward',K=5)
>
> #call with the rdd - this produces error result <- galileo(rdd,
> model='hclust',dist='euclidean',link='ward',K=5)
>
> Error in as.vector(data) :
>   no method for coercing this S4 class to a vector
>
>
> I get the same error if I try to directly call as.vector(rdd) as well.
>
> Is there a reason why this works for localDF and not rdd?  Should I be
> doing something else to coerce the object into a vector?
>
> Thanks,
> Ellen
>
>
>

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SparkR - calling as.vector() with rdd dataframe causes error

Reply via email to