Re: Possible deadlock in registering applications in the recovery mode

Niranda Perera Tue, 19 Apr 2016 14:30:33 -0700

Hi Reynold,

I have created a JIRA for this [1]. I have also created a PR for the same
issue [2].


Would be very grateful if you could look into this, because this is a
blocker in our spark deployment, which uses number of spark custom
extension.

thanks
best

[1] https://issues.apache.org/jira/browse/SPARK-14736
[2] https://github.com/apache/spark/pull/12506

On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <[email protected]> wrote:

> I haven't looked closely at this, but I think your proposal makes sense.
>
>
> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <[email protected]>
> wrote:
>
>> Hi guys,
>>
>> Any update on this?
>>
>> Best
>>
>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I have encountered a small issue in the standalone recovery mode.
>>>
>>> Let's say there was an application A running in the cluster. Due to some
>>> issue, the entire cluster, together with the application A goes down.
>>>
>>> Then later on, cluster comes back online, and the master then goes into
>>> the 'recovering' mode, because it sees some apps, workers and drivers have
>>> already been in the cluster from Persistence Engine. While in the recovery
>>> process, the application comes back online, but now it would have a
>>> different ID, let's say B.
>>>
>>> But then, as per the master, application registration logic, this
>>> application B will NOT be added to the 'waitingApps' with the message
>>> ""Attempted to re-register application at same address". [1]
>>>
>>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>>     val appAddress = app.driver.address
>>>     if (addressToApp.contains(appAddress)) {
>>>       logInfo("Attempted to re-register application at same address: " +
>>> appAddress)
>>>       return
>>>     }
>>>
>>>
>>> The problem here is, master is trying to recover application A, which is
>>> not in there anymore. Therefore after the recovery process, app A will be
>>> dropped. However app A's successor, app B was also omitted from the
>>> 'waitingApps' list because it had the same address as App A previously.
>>>
>>> This creates a deadlock in the cluster, app A nor app B is available in
>>> the cluster.
>>>
>>> When the master is in the RECOVERING mode, shouldn't it add all the
>>> registering apps to a list first, and then after the recovery is completed
>>> (once the unsuccessful recoveries are removed), deploy the apps which are
>>> new?
>>>
>>> This would sort this deadlock IMO?
>>>
>>> look forward to hearing from you.
>>>
>>> best
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>>
>>> --
>>> Niranda
>>> @n1r44 <https://twitter.com/N1R44>
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>>
>> --
>> Niranda
>> @n1r44 <https://twitter.com/N1R44>
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Re: Possible deadlock in registering applications in the recovery mode

Reply via email to