Thanks Andrew,

Is there a chance that even with full-caching, that modes other than
PROCESS_LOCAL will be used? E.g., let's say, an executor will try to
perform tasks although the data are cached on a different executor.

What I'd like to do is to prevent such a scenario entirely.

I'd like to know if setting 'spark.locality.wait' to a very high value
would guarantee that the mode will always be 'PROCESS_LOCAL'.


On Thu, Jun 5, 2014 at 3:36 PM, Andrew Ash <and...@andrewash.com> wrote:

> The locality is how close the data is to the code that's processing it.
>  PROCESS_LOCAL means data is in the same JVM as the code that's running, so
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
> same node, or in another executor on the same node, so is a little slower
> because the data has to travel across an IPC connection.  RACK_LOCAL is
> even slower -- data is on a different server so needs to be sent over the
> network.
>
> Spark switches to lower locality levels when there's no unprocessed data
> on a node that has idle CPUs.  In that situation you have two options: wait
> until the busy CPUs free up so you can start another task that uses data on
> that server, or start a new task on a farther away server that needs to
> bring data from that remote place.  What Spark typically does is wait a bit
> in the hopes that a busy CPU frees up.  Once that timeout expires, it
> starts moving the data from far away to the free CPU.
>
> The main tunable option is how far long the scheduler waits before
> starting to move data rather than code.  Those are the spark.locality.*
> settings here: http://spark.apache.org/docs/latest/configuration.html
>
> If you want to prevent this from happening entirely, you can set the
> values to ridiculously high numbers.  The documentation also mentions that
> "0" has special meaning, so you can try that as well.
>
> Good luck!
> Andrew
>
>
> On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung <coded...@cs.stanford.edu>
> wrote:
>
>> I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
>> assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
>>
>> When these happen things get extremely slow.
>>
>> Does this mean that the executor got terminated and restarted?
>>
>> Is there a way to prevent this from happening (barring the machine
>> actually going down, I'd rather stick with the same process)?
>>
>
>

Reply via email to