Re: data localisation in spark

2015-06-03 Thread Sandy Ryza
Tasks are scheduled on executors based on data locality. Things work as you would expect in the example you brought up. Through dynamic allocation, the number of executors can change throughout the life time of an application. 10 executors (or 5 executors with 2 cores each) are not needed for a

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
So in spark is after acquiring executors from ClusterManeger, does tasks are scheduled on executors based on datalocality ?I Mean if in an application there are 2 jobs and output of 1 job is used as input of another job. And in job1 I did persist on some RDD, then while running job2 will it use th

Re: data localisation in spark

2015-06-02 Thread Sandy Ryza
It is not possible with JavaSparkContext either. The API mentioned below currently does not have any effect (we should document this). The primary difference between MR and Spark here is that MR runs each task in its own YARN container, while Spark runs multiple tasks within an executor, which ne

Re: data localisation in spark

2015-06-02 Thread Shushant Arora
Is it possible in JavaSparkContext ? JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDDlines = jsc.textFile(args[0]); If yes , does its programmer's responsibilty to first calculate splits locations and then instantiate spark context with preferred locations? How does its achieved in MR2

Re: data localisation in spark

2015-05-31 Thread Sandy Ryza
Hi Shushant, Spark currently makes no effort to request executors based on data locality (although it does try to schedule tasks within executors based on data locality). We're working on adding this capability at SPARK-4352 . -Sandy On Sun, May