Sean,
I do agree about the "inside out" parallelization but my curiosity is
mostly in what type of performance I can expect to have by piping out to R.
I'm playing with Twitter's new Anomaly Detection library btw, this could be
a solution if I can get the calls to R to stand up to the massive data
This "inside out" parallelization has been a way people have used R
with MapReduce for a long time. Run N copies of an R script on the
cluster, on different subsets of the data, babysat by Mappers. You
just need R installed on the cluster. Hadoop Streaming makes this easy
and things like RDD.pipe i
You could call R with JRI or through rdd.forEachPartition(pass_data_to_R) or
rdd.pipe
From: cjno...@gmail.com
Date: Wed, 1 Apr 2015 19:31:48 -0400
Subject: Re: Streaming anomaly detection using ARIMA
To: user@spark.apache.org
Surprised I haven't gotten any responses about this. Has anyone t
Surprised I haven't gotten any responses about this. Has anyone tried using
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
other way- what I'd like to do is use R for model calculation and Spark to
distribute the load across the cluster.
Also, has anyone used Scalation for
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet wrote:
> I wan