Thanks Fred for the detailed reply. The stability points are especially interesting as a goal for the streaming component in Spark. In terms of next steps, one approach that might be helpful is trying to create benchmarks or situations that mimic real-life workloads and then we can work on isolating specific changes that are required etc. It'd also be great to hear other approaches / next steps to concretize some of these goals.
Thanks Shivaram On Thu, Oct 13, 2016 at 8:39 AM, Fred Reiss <freiss....@gmail.com> wrote: > On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman > <shiva...@eecs.berkeley.edu> wrote: >> >> > >> Could you expand a little bit more on stability ? Is it just bursty >> workloads in terms of peak vs. average throughput ? Also what level of >> latencies do you find users care about ? Is it on the order of 2-3 >> seconds vs. 1 second vs. 100s of milliseconds ? >> > > > > Regarding stability, I've seen two levels of concrete requirements. > > The first is "don't bring down my Spark cluster". That is to say, regardless > of the input data rate, Spark shouldn't thrash or crash outright. Processing > may lag behind the data arrival rate, but the cluster should stay up and > remain fully functional. > > The second level is "don't bring down my application". A common use for > streaming systems is to handle heavyweight computations that are part of a > larger application, like a web application, a mobile app, or a plant control > system. For example, an online application for car insurance might need to > do some pretty involved machine learning to produce an accurate quote and > suggest good upsells to the customer. If the heavyweight portion times out, > the whole application times out, and you lose a customer. > > In terms of bursty vs. non-bursty, the "don't bring down my Spark cluster > case" is more about handling bursts, while the "don't bring down my > application" case is more about delivering acceptable end-to-end response > times under typical load. > > Regarding latency: One group I talked to mentioned requirements in the > 100-200 msec range, driven by the need to display a web page on a browser or > mobile device. Another group in the Internet of Things space mentioned times > ranging from 5 seconds to 30 seconds throughout the conversation. But most > people I've talked to have been pretty vague about specific numbers. > > My impression is that these groups are not motivated by anxiety about > meeting a particular latency target for a particular application. Rather, > they want to make low latency the norm so that they can stop having to think > about latency. Today, low latency is a special requirement of special > applications. But that policy imposes a lot of hidden costs. IT architects > have to spend time estimating the latency requirements of every application > and lobbying for special treatment when those requirements are strict. > Managers have to spend time engineering business processes around latency. > Data scientists have to spend time packaging up models and negotiating how > those models will be shipped over to the low-latency serving tier. And > customers who are accustomed to Google and smartphones end up with an > experience that is functional but unsatisfying. > > It's best to think of latency as a sliding scale. A given level of latency > imposes a given level of cost enterprise-wide. Someone who is making a > decision on middleware policy will balance this cost against other costs: > How much does it cost to deploy the middleware? How much does it cost to > train developers to use the system? The winner will be the system that > minimizes the overall cost. > > Fred --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org