To give a bit of a background to my post: I'm currently looking into evaluating whether it could be possible to deploy a Spark-powered DataWareHouse-like structure. In particular the client is interested in evaluating Spark's in-memory caches as "transient persistent layer". Although in theory this could be done, I'm now looking into different avenues of how to do it properly.
The various job-servers initially appeared to be an option, until I actually looked at the current feature-levels and progress. Much like Hive 2/LLAP, a case of "might work [for you] in around two years or so". So now is to finding out why that's the case, and how to actually get to the point, where these features could work in 2 years, and whether they should work at all.... On Tue, Jan 17, 2017 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote: > On Tue, Jan 17, 2017 at 4:49 PM Rick Moritz <rah...@gmail.com> wrote: > >> * Oryx2 - This was more focused on a particular issue, and looked to be a >> very nice framework for deploying real-time analytics --- but again, no >> real traction. In fact, I've heard of PoCs being done by/for Cloudera, to >> demo Lambda-Architectures with Spark, and this was not showcased. >> > > This one is not like the others IMHO (I'm mostly the author). It > definitely doesn't serve access to Spark jobs. It's ML-focused, hence, much > more narrowly applicable than what 'lambda' would encompass. In practice > it's used as an application, for recommendations, only. Niche, but does > what it does well. It isn't used as a general platform by even Cloudera. > It's framed as a reference architecture for an app. > I agree, that the others are more generalized approaches, but I found Oryx2 very interesting, since it appears to provide exactly what you mention. As a reference architecture for an app of course, it's not quite as useful. It basically boils down to the question of "official support", which our enterprise customers usually require. > > >> >> * Livy - Although Livy still appears to live, I'm not really seeing the >> progress, that I anticipated after first hearing about it at the 2015 Spark >> Summit Europe. Maybe it's because the documentation isn't quite there yet, >> maybe it's because features are missing -- somehow from my last look at it, >> it's not enterprise-ready quite yet, while offering a feature-set that >> should be driving enterprise adoption. >> > > This and Job Server (and about 3 other job server tools) do the mostly > same thing. Livy is a Cloudera project that began to support a > notebook-like tool in Hue. I think this didn't really go live because of > grander plans that should emerge from the Sense acquisition. Livy's still > definitely active, and I think it was created instead of adopting another > tool at the time because it was deemed easier to build in the enterprise-y > requirements like security from scratch. Go figure. I don't know how much > Livy is really meant to be a general tool pushed for general consumption. > It has existed to support CDH-related notebook tools, as I understand it, > to date. > Even HDP has adopted this as a first-class citizen as an additional middleware for Zeppelin (hopefully enabling the latter to reduce its complexity in that department over time). So it definitely has the most support from the variants I mentioned - but at least from a documentation stand-point it's still very early days, and not being on the Apache incubator might hinder adoption. Getting out from under Hue's umbrella was a strong move though, and I will surely keep watching it - I was just disappointed with the year-over-year progress, in particular given that it had strong "manufacturer support". > >> Now, with that said - why did these products not gain bigger traction? Is >> it because Spark isn't quite ready yet? Is it because of a missed marketing >> opportunity? >> > > I have a collection of guesses. First, you may be surprised how early it > is for most companies to be using Spark in a basic way, let alone with > 'middleware'. > I'm quite close to the "front lines" - in particular on the "enterprise level", and I think adoption is really taking off right now. Hence my interest in looking into abstraction layers (and maybe Hive is enough) for Spark, which enable the different end-users to each do their thing, while both sharing data and hiding secrets. > > Maybe none are all that mature? none that i know have vendor backing (Livy > is not formally supported by Cloudera, even). Maybe the small market is > fragmented? > That does absolutely appear to be the case -- of course it raises the question, why vendor-backing is currently so hard to come by. Given the multitude of attempts at doing the same thing, I think the requirement for a solution in the space is pretty evident > > Not all of these things do quite the same thing. The space of 'apps' and > 'middleware' is big. > Agreed - and that's before mentioning even more complex products, such as Ignite, which could also be pressed into the loose outline I defined. What's frustrating to me, is that ever since I started looking at Spark, these solutions existed, but none of them has reached any kind of maturity or backing yet. And although I will readily admit, that miracles don't happen in software development, the fragmentation of the community that has happened, probably isn't driving this space forward. On the other hand - exploring different avenues is a valid strategy in the early phases that this tech is still in. Nonetheless, I think some consolidation in the space should be embraced. > Not all (many?) use cases require long-running Spark jobs. That is what > these tools provide. It's not what Spark was built for. Using it this way > has rough edges. I actually think it's this, relative lack of demand. > Also agreed - interactive long running jobs are quite fragile and not perfectly production-ready. This puts a pretty hard cap on demand. Yet some hardy souls have been producing solutions for years by now - so demand can't be THAT low - and the odd post on this list does ask for it. > > > >> Also, I'm looking at this with my enterprise-glasses on: So fine-grained >> user authorization and authentication features are very important, as are >> consistency and resiliency features. Since long-running interactive >> Spark-jobs are still a mixed bag stability-wise, this >> > > Security integration is the big issue as I understand. I don't think any > of these tools can fully guarantee resiliency and consistency in the > general case. A Spark job can have one driver only and there is no HA. > Resource managers already manage restarting failed drivers. I don't know if > that's the issue. > Yes, security is definitely the key ingredient. Additional resilience probably won't be possible for the time being, unless interactive long-runners should become first-class citizens in Spark. Maybe that's the underlying discussion that should be had before going into middleware/service-layer discussions. Resilience against interactive users (who try to OOM any system they can get their hands on) is a mammoth task. Consistency between different, "parallel" Spark-Sessions might be a topic that should have to be managed by such a middleware, in the ideal case. > layer of middleware should provide a necessary buffer between crashes of >> the driver program, and serving results. >> Ecosystem support is also a must - why aren't there Tableau connectors >> for (some of) these APIs? [Because they're too obscure...] >> > > (It's much easier to plug Tableau into Impala via ODBC to do Tableau-like > things on the same Parquet-formatted data you'd access in Spark.) > > I'm aware of the easy solutions - and that option absolutely is on my list of propositions. Yet, I want to live the dream of elegantly only keeping a single (replicated) copy of my data in memory and avoid going to disk except to recover/extend my state. Alas, 't was but a dream ;)