Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory.
Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <greg.h...@rackspace.com> wrote: > I thought I had this all figured out, but I'm getting some weird errors > now that I'm attempting to deploy this on production-size servers. It's > complaining that I'm not allocating enough memory to the memoryOverhead > values. I tracked it down to this code: > > > https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 > > Unless I'm reading it wrong, those checks are enforcing that you set > spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but > that makes no sense to me since that memory is just supposed to be what > YARN needs on top of what you're allocating for Spark. My understanding > was that the overhead values should be quite a bit lower (and by default > they are). > > Also, why must the executor be allocated less memory than the driver's > memory overhead value? > > What am I misunderstanding here? > > Greg > > From: Andrew Or <and...@databricks.com> > Date: Tuesday, September 9, 2014 5:49 PM > To: Greg <greg.h...@rackspace.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: clarification for some spark on yarn configuration options > > Hi Greg, > > SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. > The equivalent "spark.executor.instances" is just another way to set the > same thing in your spark-defaults.conf. Maybe this should be documented. :) > > "spark.yarn.executor.memoryOverhead" is just an additional margin added > to "spark.executor.memory" for the container. In addition to the executor's > memory, the container in which the executor is launched needs some extra > memory for system processes, and this is what this "overhead" (somewhat of > a misnomer) is for. The same goes for the driver equivalent. > > "spark.driver.memory" behaves differently depending on which version of > Spark you are using. If you are using Spark 1.1+ (this was released very > recently), you can directly set "spark.driver.memory" and this will take > effect. Otherwise, setting this doesn't actually do anything for client > deploy mode, and you have two alternatives: (1) set the environment > variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are > using Spark submit (or bin/spark-shell, or bin/pyspark, which go through > bin/spark-submit), pass the "--driver-memory" command line argument. > > If you want your PySpark application (driver) to pick up extra class > path, you can pass the "--driver-class-path" to Spark submit. If you are > using Spark 1.1+, you may set "spark.driver.extraClassPath" in your > spark-defaults.conf. There is also an environment variable you could set > (SPARK_CLASSPATH), though this is now deprecated. > > Let me know if you have more questions about these options, > -Andrew > > > 2014-09-08 6:59 GMT-07:00 Greg Hill <greg.h...@rackspace.com>: > >> Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster >> or the workers per slave node? >> >> Is spark.executor.instances an actual config option? I found that in a >> commit, but it's not in the docs. >> >> What is the difference between spark.yarn.executor.memoryOverhead and >> spark.executor.memory >> ? Same question for the 'driver' variant, but I assume it's the same >> answer. >> >> Is there a spark.driver.memory option that's undocumented or do you >> have to use the environment variable SPARK_DRIVER_MEMORY? >> >> What config option or environment variable do I need to set to get >> pyspark interactive to pick up the yarn class path? The ones that work for >> spark-shell and spark-submit don't seem to work for pyspark. >> >> Thanks in advance. >> >> Greg >> > >