On Fri, Nov 7, 2014 at 6:20 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote:
> If, for example, you have a cluster of 100 machines, this means the >> scheduler can launch 150 tasks per machine per second. > > > Did you mean 15 tasks per machine per second here? Or alternatively, 10 > machines? > Yes -- sorry for the terrible math there! > > I don't know of any existing Spark clusters that have a large enough >> number of machines or short enough tasks to justify the added complexity of >> distributing the scheduler. > > > Actually, this was the reason I took interest in Sparrow--specifically, > the idea of a Spark cluster handling many very short (<< 50 ms) tasks. > > At the recent Spark Committer Night > <http://www.meetup.com/Spark-NYC/events/209271842/> in NYC, I asked > Michael if he thought that Spark SQL could eventually completely fill the > need for very low latency queries currently served by MPP databases like > Redshift or Vertica. If I recall correctly, he said that the main obstacle > to that was simply task startup time, which is on the order of 100 ms. > > Is there interest in (or perhaps an existing initiative related to) > improving task startup times to the point where one could legitimately look > at Spark SQL as a low latency database that can serve many users or > applications at once? That would probably make a good use case for Sparrow, > no? > Shorter tasks would indeed be a good use case for Sparrow, and was the motivation behind the Sparrow work. When evaluating Sparrow, we focused on running SQL workloads where tasks were in the 50-100ms range (detailed in the paper <http://people.csail.mit.edu/matei/papers/2013/sosp_sparrow.pdf>). I know Evan, who I added here, has been looking at task startup times in the context of ML workloads; this motivated some recent work (e.g., https://issues.apache.org/jira/browse/SPARK-3984) to improve metrics shown in the UI to describe task launch overhead. For jobs we've looked at, task startup time was at most tens of milliseconds (I also remember this being the case when we ran short tasks on Sparrow). Decreasing this seems like it would be widely beneficial, especially if there are cases where it's more like 100ms, as Michael alluded. Hopefully some of the improved UI reporting will help to understand the degree to which this is (or is not) an issue. I'm not sure how much Evan is attempting to quantify the overhead versus fix it -- so I'll let him chime in here. > Nick > >