I haven't looked closely at this, but I think your proposal makes sense.
On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <[email protected]> wrote: > Hi guys, > > Any update on this? > > Best > > On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <[email protected] > > wrote: > >> Hi all, >> >> I have encountered a small issue in the standalone recovery mode. >> >> Let's say there was an application A running in the cluster. Due to some >> issue, the entire cluster, together with the application A goes down. >> >> Then later on, cluster comes back online, and the master then goes into >> the 'recovering' mode, because it sees some apps, workers and drivers have >> already been in the cluster from Persistence Engine. While in the recovery >> process, the application comes back online, but now it would have a >> different ID, let's say B. >> >> But then, as per the master, application registration logic, this >> application B will NOT be added to the 'waitingApps' with the message >> ""Attempted to re-register application at same address". [1] >> >> private def registerApplication(app: ApplicationInfo): Unit = { >> val appAddress = app.driver.address >> if (addressToApp.contains(appAddress)) { >> logInfo("Attempted to re-register application at same address: " + >> appAddress) >> return >> } >> >> >> The problem here is, master is trying to recover application A, which is >> not in there anymore. Therefore after the recovery process, app A will be >> dropped. However app A's successor, app B was also omitted from the >> 'waitingApps' list because it had the same address as App A previously. >> >> This creates a deadlock in the cluster, app A nor app B is available in >> the cluster. >> >> When the master is in the RECOVERING mode, shouldn't it add all the >> registering apps to a list first, and then after the recovery is completed >> (once the unsuccessful recoveries are removed), deploy the apps which are >> new? >> >> This would sort this deadlock IMO? >> >> look forward to hearing from you. >> >> best >> >> [1] >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 >> >> -- >> Niranda >> @n1r44 <https://twitter.com/N1R44> >> +94-71-554-8430 >> https://pythagoreanscript.wordpress.com/ >> > > > > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > +94-71-554-8430 > https://pythagoreanscript.wordpress.com/ >
