Hi List,

I've been following several projects with quite some interest over the past
few years, and I've continued to wonder, why they're not moving towards a
degree of being supported by mainstream Spark-distributions, and more
frequently mentioned when it comes to enterprise adoption of Spark.

The list of such "middleware" components, that each sounded like it would
mostly succeed the classic APIs to Spark (at least in their respective
domains), that I've com across, are the following [In order of appearance]:

* Spark JobServer - This was one of the first examples that I came across,
it was quite exciting at the time, but I've not heard much of it since. I
assume the focus was on stabilizing the code base.

* Oryx2 - This was more focused on a particular issue, and looked to be a
very nice framework for deploying real-time analytics --- but again, no
real traction. In fact, I've heard of PoCs being done by/for Cloudera, to
demo Lambda-Architectures with Spark, and this was not showcased.

* Livy - Although Livy still appears to live, I'm not really seeing the
progress, that I anticipated after first hearing about it at the 2015 Spark
Summit Europe. Maybe it's because the documentation isn't quite there yet,
maybe it's because features are missing -- somehow from my last look at it,
it's not enterprise-ready quite yet, while offering a feature-set that
should be driving enterprise adoption.

* Mist - Just discovered it today, thinking, "great, ANOTHER middleware"
and prompting this post. It looks quite fully featured, but can it succeed?
On the plus side, it's linked to a small, focused business, on the down
side it's linked to a small, focused business. Positive, since that drives
development along nicely; negative, since it inhibits adoption in the
enterprise space.


Now, with that said - why did these products not gain bigger traction? Is
it because Spark isn't quite ready yet? Is it because of a missed marketing
opportunity?

And on another note: Should Spark integrate such a wrapper "by default"?
It's a step further on from the SparkSQL Thrift interface, towards offering
not just programming API's, but service-APIs. Considering that there are so
many different interpretations of how this should be solved, bundling the
effort into a default-implementation could be beneficial. On the other
hand, feature creep of this magnitude probably isn't desirable.

I'd hope to hear some community opinions, in particular from
developers/users of these or other similar projects. If I overlooked your
similar project: Please pitch it -- I think this part of the ecosystem is
shaping up to be quite exciting.

Also, I'm looking at this with my enterprise-glasses on: So fine-grained
user authorization and authentication features are very important, as are
consistency and resiliency features. Since long-running interactive
Spark-jobs are still a mixed bag stability-wise, this layer of middleware
should provide a necessary buffer between crashes of the driver program,
and serving results.
Ecosystem support is also a must - why aren't there Tableau connectors for
(some of) these APIs? [Because they're too obscure...]

A closing note: This could of course just be the open-source/enterprise
egg/hen issue: Open Source projects without large scale vendor support
aren't interesting for the enterprise, and enterprise features aren't
interesting for the non-enterprise developer. And worse, I wonder how man
in-house custom solutions/extensions of these projects exist in the wild,
because enterprise developers aren't usually allowed to share code back
into open source projects.

Thanks for putting up with this post this far,

Best

Rick

Reply via email to