AHA! I figured it out, but it required some tedious remote debugging of the
Spark ApplicationMaster. (But now I understand the Spark codebase a little
better than before, so I guess I'm not too put out. =P)

Here's what's happening...

I am setting spark.dynamicAllocation.minExecutors=1 but am not setting
spark.dynamicAllocation.initialExecutors, so it's remaining at the default
of spark.dynamicAllocation.minExecutors. However, ExecutorAllocationManager
doesn't actually request any executors while the application is still
initializing (see comment here
<https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L292>),
but it still sets numExecutorsTarget to
spark.dynamicAllocation.initialExecutors (i.e., 1).

The JavaWordCount example I've been trying to run is only operating on a
very small file, so its first stage only has a single task and thus should
request a single executor once the polling loop comes along.

Then on this line
<https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L308>,
it returns numExecutorsTarget (1) - oldNumExecutorsTarget (still 1, even
though there aren't any executors running yet) = 0, for the number of
executors it should request. Then the app hangs forever because it never
requests any executors.

I verified this further by setting spark.dynamicAllocation.minExecutors=100
and trying to run my SparkPi example I mentioned earlier (which runs 100
tasks in its first stage because that's the number I'm passing to the
driver). Then it would hang in the same way as my JavaWordCount example. If
I run it again, passing 101 (so that it has 101 tasks), it works, and if I
pass 99, it hangs again.

So it seems that I have found a bug in that if you set
spark.dynamicAllocation.minExecutors (or, presumably,
spark.dynamicAllocation.initialExecutors), and the number of tasks in your
first stage is less than or equal to this min/init number of executors, it
won't actually request any executors and will just hang indefinitely.

I can't seem to find a JIRA for this, so shall I file one, or has anybody
else seen anything like this?

~ Jonathan

On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Another update that doesn't make much sense:
>
> The SparkPi example does work on yarn-cluster mode with dynamicAllocation.
>
> That is, the following command works (as well as with yarn-client mode):
>
> spark-submit --deploy-mode cluster --class
> org.apache.spark.examples.SparkPi spark-examples.jar 100
>
> But the following one does not work (nor does it work for yarn-client
> mode):
>
> spark-submit --deploy-mode cluster --class
> org.apache.spark.examples.JavaWordCount spark-examples.jar
> /tmp/word-count-input.txt
>
> So this JavaWordCount example hangs on requesting executors, while SparkPi
> and spark-shell do work.
>
> ~ Jonathan
>
> On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Thanks for the quick response!
>>
>> spark-shell is indeed using yarn-client. I forgot to mention that I also
>> have "spark.master yarn-client" in my spark-defaults.conf file too.
>>
>> The working spark-shell and my non-working example application both
>> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are
>> asking about? I haven't actually messed around with different scheduler
>> modes yet.
>>
>> One more thing I should mention is that the YARN ResourceManager tells me
>> the following on my 5-node cluster, with one node being the master and not
>> running a NodeManager:
>> Memory Used: 1.50 GB (this is the running ApplicationMaster that's
>> waiting and waiting for the executors to start up)
>> Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
>> VCores Used: 1
>> VCores Total: 32
>> Active Nodes: 4
>>
>> ~ Jonathan
>>
>> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
>> wrote:
>>
>>> What pool is the spark shell being put into? (You can see this through
>>> the YARN UI under scheduler)
>>>
>>> Are you certain you're starting spark-shell up on YARN? By default it
>>> uses a local spark executor, so if it "just works" then it's because it's
>>> not using dynamic allocation.
>>>
>>>
>>> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com>
>>> wrote:
>>>
>>>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0
>>>> after using it successfully on an identically configured cluster with Spark
>>>> 1.4.1.
>>>>
>>>> I'm getting the dreaded warning "YarnClusterScheduler: Initial job has
>>>> not accepted any resources; check your cluster UI to ensure that workers
>>>> are registered and have sufficient resources", though there's nothing else
>>>> running on my cluster, and the nodes should have plenty of resources to run
>>>> my application.
>>>>
>>>> Here are the applicable properties in spark-defaults.conf:
>>>> spark.dynamicAllocation.enabled  true
>>>> spark.dynamicAllocation.minExecutors 1
>>>> spark.shuffle.service.enabled true
>>>>
>>>> When trying out my example application (just the JavaWordCount example
>>>> that comes with Spark), I had not actually set spark.executor.memory or any
>>>> CPU core-related properties, but setting the spark.executor.memory to a low
>>>> value like 64m doesn't help either.
>>>>
>>>> I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each
>>>> node has 15.0GB and 4 cores.
>>>>
>>>> I've also tried both yarn-cluster and yarn-client mode and get the same
>>>> behavior for both, except that for yarn-client mode the application never
>>>> even shows up in the YARN ResourceManager. However, spark-shell seems to
>>>> work just fine (when I run commands, it starts up executors dynamically
>>>> just fine), which makes no sense to me.
>>>>
>>>> What settings/logs should I look at to debug this, and what more
>>>> information can I provide? Your help would be very much appreciated!
>>>>
>>>> Thanks,
>>>> Jonathan
>>>>
>>>
>>
>

Reply via email to