Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

Brandon Amos Mon, 25 Aug 2014 00:28:20 -0700

\cc David Tompkins and Jim Donahue if they have anything to add.
\cc My school email. Please include bamos_cmu.edu for further discussion.


Hi Deb,


Debasish Das wrote
> Looks very cool...will try it out for ad-hoc analysis of our datasets and
> provide more feedback...
> 
> Could you please give bit more details about the differences of Spindle
> architecture compared to Hue + Spark integration (python stack) and Ooyala
> Jobserver ?
> 
> 
> Does Spindle allow sharing of spark context over multiple spark jobs like
> jobserver ?

Great point, I think these jobservers would work well with Spindle on larger
clusters.
I've added the following portion to the README to mention this as an
area of future work.

Regards,
Brandon.

---

## Future Work - Utilizing Spark job servers or resource managers.
Spindle's architecture can likely be improved on larger clusters by
utilizing a job server or resource manager to
maintain a pool of Spark contexts for query execution.
[Ooyala's spark-jobserver][spark-jobserver] provides
a RESTful interface for submitting Spark jobs that Spindle could
interface with instead of interfacing with Spark directly.
[YARN][yarn] can also be used to manage Spark's
resources on a cluster, as described in [this article][spark-yarn].

However, allocating resources on the cluster raises additional
questions and engineering work that Spindle can address in future work.
Spindle's current architecture coincides HDFS and Spark workers
on the same nodes, minimizing the network traffic required
to load data.
How much will the performance degrade if the resource manager
allocates some subset of Spark workers that don't
coincide with any of the HDFS data being accessed?

Furthermore, how would a production-ready caching policy
on a pool of Spark Contexts look?
What if many queries are being submitted and executed on
different Spark Contexts that use the same data?
Scheduling the queries on the same Spark Context and
caching the data between query executions would substantially
increase the performance, but how should the scheduler
be informed of this information?

[spark-jobserver]: https://github.com/ooyala/spark-jobserver
[yarn]:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[spark-yarn]:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

Reply via email to