At Uber we have observed the same resource efficiency issue with dynamic allocation. Our workload is migrated from Hive on MR to Hive on Spark. We saw significant performance improvement (>2X) with our workload. We also expected big resource savings from this migration because there will be one single Spark job to replace what would be many MR jobs. While there are cases where Spark improves efficiency compared to MR, in many cases the resource efficiency is significantly lower in Spark than in MR. This is especially a great surprise to Hive on Spark because it essentially runs the same Hive code and mainly uses Spark's shuffle. Resource consumption by Spark would be a big issue for efficiency aware users.
We also analyzed that the resource consumption problem is tightly related to the way dynamic allocation works. (Static allocation could be worse in terms of efficiency.) The inefficiency probably comes in two aspects: 1. there is a big overhead of deallocating executors when they are no longer needed. By default, an executor will die out after 60s idle time. (While one can argue that this can be tuned, but our experience showed that getting lower than 60s will trigger out allocation problems.) When thousands of executors are allocated and deallocated, idling out executors wastes a lot mem-secs and core-secs in addition to allocating them in the first place. Moreover, if in any stage there is a strangling task, all other executors will be idling out. 2. Spark's executor tends to be bigger, usually with multiple cores, so it has bigger inertia. Even if there is only one task to run, you still need to allocate a big executor which could run multiple tasks. Having said that, I'm not sure if the proposal in SPARK-22683 is the solution. Nevertheless, we need to step back and have a general agreement on whether there is resource efficiency problem in the dynamic allocation (or Spark in general). From the JIRA discussions I sensed that there might folks believing that the problem is only a matter of tuning. With Julien's accounting and our experience, the problem is real and big. Once we agree that there is problem, finding the right solution is the next question. To me, the problem goes beyond just a matter of tuning. (SPARK-22683 seems mitigating the problem also by fine tuning.) We need to think of how Spark allocates resources and executes tasks. Besides of dynamic allocation, we may consider bringing in a resource allocation mode that's similar to the old, MR style allocation. That is, allocating (smaller) executors only when needed and killing them right after them executing tasks. Thanks, Xuefu On Mon, Dec 11, 2017 at 7:44 AM, Julien Cuquemelle <j.cuqueme...@criteo.com> wrote: > Hi everyone, > > > > I'm currently porting a MapReduce Application to Spark (on a YARN > cluster), and I'd like to have your insight regarding to the tuning of > numbers of executors. > > > > This application is in fact a template that users can use to launch a > variety of jobs which range from tens to thousands of splits in the data > partition and have a typical wall clock time of 400 to 9000 seconds; these > jobs are experiments that are usually performed once or a few times, thus > are not easily tunable for resource consumption. > > > > As such, as it is the case with the MR jobs, I'd like the users not to > have to specify a number of executors themselves, which is why I explored > the possibilities of the Dynamic Allocation in Spark. > > > > The current implementation of the dynamic allocation targets a total > number of executors so that each core per executor executes 1 task; This > seems to be in contradiction with spark guidelines that tasks should remain > small and that each executor-core should process several tasks, and > actually this gives the best overall latency but at the cost of a huge > resource waste because some executors are not fully used, or even not used > at all after having been allocated : in a representative set of experiments > from our users, performed in an idle queue as well as in a busy queue, in > average the latency in spark is decreased by 43%, but at the cost of an > increase of 114% in the Vcore-hours usage w.r.t. the legacy MR job. > > > > Up to now I can't migrate these jobs to spark because of the doubling of > resource usage. > > > > I did a proposal to allow tuning the target number of tasks that each > executor-core (aka taskSlot) should process, which gives a way to tune the > tradeoff between latency and vCore-Hours consumption: > https://issues.apache.org/jira/browse/SPARK-22683 > > > > As detailed in the proposal, I've been able to reach a 37% reduction in > latency at iso-consumption (2 tasks per taskSlot), or a 30% reduction in > resource usage at iso-latency (6 tasks per taskSlot), or a sweet spot at > 20% reduction in resource consumption and 28% reduction in latency at 3 > tasks per slots. These figures are still averages over a representative set > of jobs our users currently launch, and are to be compared with the > doubling of resources usage of the current spark dynamic allocation > behavior wrt MR. > > As mentioned by Sean Owen in our discussion of the proposal, we currently > have 2 options allowing to tune the behavior of the dynamic allocation of > executors, maxExecutors and schedulerBacklogTimeout, but these parameters > would need to be tuned on a per-job basis, which is not compatible with the > one-shot nature of the jobs I'm talking about. > > > > I’ve still tried to tune one series of jobs with the > schedulerBacklogTimeout, and managed to get a similar vcore-hours > consumption at the expense of 20% added in latency: > > - the resulting value of schedulerBacklogTimeout is only valid for other > jobs that have a similar wall clock time, so will not be transposable to > all the jobs our users launch; > > - even with a manually-tuned job, I don't get the same efficiency as a > more global default I can set using my proposal. > > > > Thanks for any feedback regarding the pros and cons of adding a 3rd > parameter to the dynamic allocation allowing to optimize the latency / > consumption tradeoff over a family of jobs, or any proposal to achieve > reducing resource usage without per job tuning with the existing Dynamic > Allocation policy > > > > Julien >