Re: clarification for some spark on yarn configuration options

Andrew Or Mon, 22 Sep 2014 18:11:43 -0700

Hi Greg,

>From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
actually picked up in cluster mode. This is a bug and I have opened a PR to
fix it: https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both
client and cluster mode.


Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi <nr...@cloudera.com>:

> Maybe try --driver-memory if you are using spark-submit?
>
> Thanks,
> Nishkam
>
> On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <greg.h...@rackspace.com>
> wrote:
>
>>  Ah, I see.  It turns out that my problem is that that comparison is
>> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
>> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
>> master.  'yarn-client' seems to pick up the values and works fine.
>>
>>  Greg
>>
>>   From: Nishkam Ravi <nr...@cloudera.com>
>> Date: Monday, September 22, 2014 3:30 PM
>> To: Greg <greg.h...@rackspace.com>
>> Cc: Andrew Or <and...@databricks.com>, "user@spark.apache.org" <
>> user@spark.apache.org>
>>
>> Subject: Re: clarification for some spark on yarn configuration options
>>
>>   Greg, if you look carefully, the code is enforcing that the
>> memoryOverhead be lower (and not higher) than spark.driver.memory.
>>
>>  Thanks,
>> Nishkam
>>
>> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <greg.h...@rackspace.com>
>> wrote:
>>
>>>  I thought I had this all figured out, but I'm getting some weird
>>> errors now that I'm attempting to deploy this on production-size servers.
>>> It's complaining that I'm not allocating enough memory to the
>>> memoryOverhead values.  I tracked it down to this code:
>>>
>>>
>>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>>
>>>  Unless I'm reading it wrong, those checks are enforcing that you set
>>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>>> that makes no sense to me since that memory is just supposed to be what
>>> YARN needs on top of what you're allocating for Spark.  My understanding
>>> was that the overhead values should be quite a bit lower (and by default
>>> they are).
>>>
>>>  Also, why must the executor be allocated less memory than the driver's
>>> memory overhead value?
>>>
>>>  What am I misunderstanding here?
>>>
>>>  Greg
>>>
>>>   From: Andrew Or <and...@databricks.com>
>>> Date: Tuesday, September 9, 2014 5:49 PM
>>> To: Greg <greg.h...@rackspace.com>
>>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>>> Subject: Re: clarification for some spark on yarn configuration options
>>>
>>>   Hi Greg,
>>>
>>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
>>> cluster. The equivalent "spark.executor.instances" is just another way to
>>> set the same thing in your spark-defaults.conf. Maybe this should be
>>> documented. :)
>>>
>>>  "spark.yarn.executor.memoryOverhead" is just an additional margin
>>> added to "spark.executor.memory" for the container. In addition to the
>>> executor's memory, the container in which the executor is launched needs
>>> some extra memory for system processes, and this is what this "overhead"
>>> (somewhat of a misnomer) is for. The same goes for the driver equivalent.
>>>
>>>  "spark.driver.memory" behaves differently depending on which version
>>> of Spark you are using. If you are using Spark 1.1+ (this was released very
>>> recently), you can directly set "spark.driver.memory" and this will take
>>> effect. Otherwise, setting this doesn't actually do anything for client
>>> deploy mode, and you have two alternatives: (1) set the environment
>>> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
>>> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
>>> bin/spark-submit), pass the "--driver-memory" command line argument.
>>>
>>>  If you want your PySpark application (driver) to pick up extra class
>>> path, you can pass the "--driver-class-path" to Spark submit. If you are
>>> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
>>> spark-defaults.conf. There is also an environment variable you could set
>>> (SPARK_CLASSPATH), though this is now deprecated.
>>>
>>>  Let me know if you have more questions about these options,
>>> -Andrew
>>>
>>>
>>> 2014-09-08 6:59 GMT-07:00 Greg Hill <greg.h...@rackspace.com>:
>>>
>>>>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the
>>>> cluster or the workers per slave node?
>>>>
>>>>  Is spark.executor.instances an actual config option?  I found that in
>>>> a commit, but it's not in the docs.
>>>>
>>>>  What is the difference between spark.yarn.executor.memoryOverhead and 
>>>> spark.executor.memory
>>>> ?  Same question for the 'driver' variant, but I assume it's the same
>>>> answer.
>>>>
>>>>  Is there a spark.driver.memory option that's undocumented or do you
>>>> have to use the environment variable SPARK_DRIVER_MEMORY?
>>>>
>>>>  What config option or environment variable do I need to set to get
>>>> pyspark interactive to pick up the yarn class path?  The ones that work for
>>>> spark-shell and spark-submit don't seem to work for pyspark.
>>>>
>>>>  Thanks in advance.
>>>>
>>>>  Greg
>>>>
>>>
>>>
>>
>

Re: clarification for some spark on yarn configuration options

Reply via email to