Hi all,

I have encountered a small issue in the standalone recovery mode.

Let's say there was an application A running in the cluster. Due to some
issue, the entire cluster, together with the application A goes down.

Then later on, cluster comes back online, and the master then goes into the
'recovering' mode, because it sees some apps, workers and drivers have
already been in the cluster from Persistence Engine. While in the recovery
process, the application comes back online, but now it would have a
different ID, let's say B.

But then, as per the master, application registration logic, this
application B will NOT be added to the 'waitingApps' with the message
""Attempted to re-register application at same address". [1]

  private def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " +
appAddress)
      return
    }


The problem here is, master is trying to recover application A, which is
not in there anymore. Therefore after the recovery process, app A will be
dropped. However app A's successor, app B was also omitted from the
'waitingApps' list because it had the same address as App A previously.

This creates a deadlock in the cluster, app A nor app B is available in the
cluster.

When the master is in the RECOVERING mode, shouldn't it add all the
registering apps to a list first, and then after the recovery is completed
(once the unsuccessful recoveries are removed), deploy the apps which are
new?

This would sort this deadlock IMO?

look forward to hearing from you.

best

[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834

-- 
Niranda
@n1r44 <https://twitter.com/N1R44>
+94-71-554-8430
https://pythagoreanscript.wordpress.com/

Reply via email to