Tianyin Xu created HADOOP-11328:
-----------------------------------

             Summary: ZKFailoverController.java does not log Exception and 
causes latent problems during failover
                 Key: HADOOP-11328
                 URL: https://issues.apache.org/jira/browse/HADOOP-11328
             Project: Hadoop Common
          Issue Type: Bug
          Components: ha
    Affects Versions: 2.5.1
            Reporter: Tianyin Xu


In _ZKFailoverController.java_, the _Exception_ caught by the _run()_ method 
does not have a single error log. This causes latent problems that are only 
manifested during failover.

h5. The problem we encountered

An _Exception_ is thrown from the _doRun()_ method during _initHM()_ (caused by 
a configuration error). If you want to repeat, you can set 
"_ha.health-monitor.connect-retry-interval.ms_" to be any nonsensical value.
{code:title=ZKFailoverController.java|borderStyle=solid}
  private int doRun(String[] args)
    ...
    initRPC();
    initHM();
    startRPC();
    ....
  }
{code}

The Exception is caught in the _run()_ method, as follows,
{code:title=ZKFailoverController.java|borderStyle=solid}
  public int run(final String[] args) throws Exception {
    ...
    try {
      ...
        @Override
        public Integer run() {
          try {
            return doRun(args);
          } catch (Exception t) {
            throw new RuntimeException(t);
          } finally {
            if (elector != null) {
              elector.terminateConnection();
            }
          }
        }
      });
    } catch (RuntimeException rte) {
      throw (Exception)rte.getCause();
    }
  }
{code}

Unfortunately, the Exception (causing the shutdown of the process) is *not 
logged at all*. This causes latent errors which is only manifested during 
failover (because ZKFC is dead). The tricky thing here is that everything looks 
perfectly fine: the _jps_ command shows a running DFSZKFailoverController 
process and the two NameNode (active and standby) work fine. 

h5. Patch

We strongly suggest to add a error log to notify the error caught, such as,

--- 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
    (revision 1641307)
+++ 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
    (working copy)
{code:title=@@ -178,6 +178,7 @@|borderStyle=solid}
         }
       });
     } catch (RuntimeException rte) {
+      LOG.fatal("The failover controller encounters runtime error: " + rte);
       throw (Exception)rte.getCause();
     }
   }
{code}

Thanks!




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to