Hi Reynold, I have created a JIRA for this [1]. I have also created a PR for the same issue [2].
Would be very grateful if you could look into this, because this is a blocker in our spark deployment, which uses number of spark custom extension. thanks best [1] https://issues.apache.org/jira/browse/SPARK-14736 [2] https://github.com/apache/spark/pull/12506 On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <[email protected]> wrote: > I haven't looked closely at this, but I think your proposal makes sense. > > > On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <[email protected]> > wrote: > >> Hi guys, >> >> Any update on this? >> >> Best >> >> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera < >> [email protected]> wrote: >> >>> Hi all, >>> >>> I have encountered a small issue in the standalone recovery mode. >>> >>> Let's say there was an application A running in the cluster. Due to some >>> issue, the entire cluster, together with the application A goes down. >>> >>> Then later on, cluster comes back online, and the master then goes into >>> the 'recovering' mode, because it sees some apps, workers and drivers have >>> already been in the cluster from Persistence Engine. While in the recovery >>> process, the application comes back online, but now it would have a >>> different ID, let's say B. >>> >>> But then, as per the master, application registration logic, this >>> application B will NOT be added to the 'waitingApps' with the message >>> ""Attempted to re-register application at same address". [1] >>> >>> private def registerApplication(app: ApplicationInfo): Unit = { >>> val appAddress = app.driver.address >>> if (addressToApp.contains(appAddress)) { >>> logInfo("Attempted to re-register application at same address: " + >>> appAddress) >>> return >>> } >>> >>> >>> The problem here is, master is trying to recover application A, which is >>> not in there anymore. Therefore after the recovery process, app A will be >>> dropped. However app A's successor, app B was also omitted from the >>> 'waitingApps' list because it had the same address as App A previously. >>> >>> This creates a deadlock in the cluster, app A nor app B is available in >>> the cluster. >>> >>> When the master is in the RECOVERING mode, shouldn't it add all the >>> registering apps to a list first, and then after the recovery is completed >>> (once the unsuccessful recoveries are removed), deploy the apps which are >>> new? >>> >>> This would sort this deadlock IMO? >>> >>> look forward to hearing from you. >>> >>> best >>> >>> [1] >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 >>> >>> -- >>> Niranda >>> @n1r44 <https://twitter.com/N1R44> >>> +94-71-554-8430 >>> https://pythagoreanscript.wordpress.com/ >>> >> >> >> >> -- >> Niranda >> @n1r44 <https://twitter.com/N1R44> >> +94-71-554-8430 >> https://pythagoreanscript.wordpress.com/ >> > > -- Niranda @n1r44 <https://twitter.com/N1R44> +94-71-554-8430 https://pythagoreanscript.wordpress.com/
