----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/43350/ -----------------------------------------------------------
(Updated Feb. 24, 2016, 2:05 a.m.) Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data Infrastructure). Repository: samza Description ------- * Throw a SamzaContainerStartException whenever the container start request to the NM fails. * Catch the exception in the Allocator and re-request for a container on ANY-HOST * Release the failed container * Also, refactored some methods in the ContainerUtil and ContainerAllocators to simplify the code. * Finally, fix a logic bug in the HostAwareContainerAllocator wherein the expiration logic was flipped. Diffs (updated) ----- samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 51e9e99e506c065938b18d081df6fe1ff24f79ec samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java 2e192eecf412525f45846edb62b23d261d35cbda samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 31fcc572d499d4e17cdf1340d7072c1562ccfdb7 samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 54db5e5b2b1b109d202e814809adbfd2bc84fb4b samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 91fae98c074e1648e7168fb8e76d6e1e656816fc samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java 8e1db77db74e1dc01bb0ec64e102befd225123e9 samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java bc5b606394351e4039d084914fe5898e6ff148a1 samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java PRE-CREATION samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java a3562a1155cbf24989200063b3ebd472b295db37 samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java e2b45d7577dea5a4a71af22521c93a7fd75eaefc samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java 269d82479650b3bc2890d250da0391d34104b1eb samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java 5fcad82fc10f87510e994896ed7380f1e636a18f samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java e7441e512e73b5d09740f252dc52efece640bea4 samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 4426ce6ac8808257dc00df50452df3d7555d6d0b samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 951e0f9504b87263eba891d367c9806248241c7d Diff: https://reviews.apache.org/r/43350/diff/ Testing ------- Added 2 unit tests and I'm doing some manual testing in parallel with this review. The manual test is to kill the NM for one of the containers of a job that uses Host Affinity. When the RM detects the outage, it should notify the AM which will try to restart the container on the same host. It will get a connection error and at that point should retry on a DIFFERENT host. The testing was successful for both scenarios. 1. RM is unaware that some NMs are down when job is starting: a. RM attempts to start AM on hosts, retrying up to the limit (50) in the case of connection errors to the NMs b. AM requests containers (accounting for host affinity) and tries to start the containers c. YarnClient retries connection errors (an excessive number of times) and then throws to the AM d. AM releases the bad container and requests a new one on ANY_HOST e. Container is eventually started on a good host. 2. RM is unaware that some NMs are down when job is already running: a. NM-RM heartbeat interval expires b. RM is aware of the missing NMs and containers c. RM notifies AM of the completed containers d. AM requests new containers which the RM allocates only on good hosts (since it is already aware of the bad hosts) e. Container is started on a good host. In both cases, the job stays alive and the full set of containers is eventually running on good hosts. Also, the state counters, like "neededContainers" are correct throughout the test. Thanks, Jake Maes