----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/43350/#review120393 -----------------------------------------------------------
Fix it, then Ship it! Looks good to me. A few minor suggestions below. samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala (line 97) <https://reviews.apache.org/r/43350/#comment181833> Usually this is done in Scala using option.getOrElse. So instead of adding a new method, we can just do config.getCommandClass.getOrElse(defaultValue). samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java (line 114) <https://reviews.apache.org/r/43350/#comment181837> SamzaException instead of NullPointerException? samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java (line 27) <https://reviews.apache.org/r/43350/#comment181838> is serialVersionUID needed here? If so please use the autogenerated one. - Xinyu Liu On Feb. 23, 2016, 9:43 p.m., Jake Maes wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/43350/ > ----------------------------------------------------------- > > (Updated Feb. 23, 2016, 9:43 p.m.) > > > Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan > (Data Infrastructure). > > > Repository: samza > > > Description > ------- > > * Throw a SamzaContainerStartException whenever the container start request > to the NM fails. > * Catch the exception in the Allocator and re-request for a container on > ANY-HOST > * Release the failed container > * Also, refactored some methods in the ContainerUtil and ContainerAllocators > to simplify the code. > * Finally, fix a logic bug in the HostAwareContainerAllocator wherein the > expiration logic was flipped. > > > Diffs > ----- > > samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala > 51e9e99e506c065938b18d081df6fe1ff24f79ec > > samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java > 2e192eecf412525f45846edb62b23d261d35cbda > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java > 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 > > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java > 54db5e5b2b1b109d202e814809adbfd2bc84fb4b > samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java > 91fae98c074e1648e7168fb8e76d6e1e656816fc > > samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java > 8e1db77db74e1dc01bb0ec64e102befd225123e9 > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java > bc5b606394351e4039d084914fe5898e6ff148a1 > > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java > PRE-CREATION > samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java > a3562a1155cbf24989200063b3ebd472b295db37 > > samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java > e2b45d7577dea5a4a71af22521c93a7fd75eaefc > > samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java > 269d82479650b3bc2890d250da0391d34104b1eb > > samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java > 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 > > samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java > 5fcad82fc10f87510e994896ed7380f1e636a18f > > samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java > 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 > > samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java > e7441e512e73b5d09740f252dc52efece640bea4 > > samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java > 4426ce6ac8808257dc00df50452df3d7555d6d0b > > samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java > 951e0f9504b87263eba891d367c9806248241c7d > > Diff: https://reviews.apache.org/r/43350/diff/ > > > Testing > ------- > > Added 2 unit tests and I'm doing some manual testing in parallel with this > review. The manual test is to kill the NM for one of the containers of a job > that uses Host Affinity. When the RM detects the outage, it should notify the > AM which will try to restart the container on the same host. It will get a > connection error and at that point should retry on a DIFFERENT host. > > The testing was successful for both scenarios. > 1. RM is unaware that some NMs are down when job is starting: > a. RM attempts to start AM on hosts, retrying up to the limit (50) in the > case of connection errors to the NMs > b. AM requests containers (accounting for host affinity) and tries to > start the containers > c. YarnClient retries connection errors (an excessive number of times) and > then throws to the AM > d. AM releases the bad container and requests a new one on ANY_HOST > e. Container is eventually started on a good host. > 2. RM is unaware that some NMs are down when job is already running: > a. NM-RM heartbeat interval expires > b. RM is aware of the missing NMs and containers > c. RM notifies AM of the completed containers > d. AM requests new containers which the RM allocates only on good hosts > (since it is already aware of the bad hosts) > e. Container is started on a good host. > > In both cases, the job stays alive and the full set of containers is > eventually running on good hosts. Also, the state counters, like > "neededContainers" are correct throughout the test. > > > Thanks, > > Jake Maes > >