-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43350/
-----------------------------------------------------------

(Updated Feb. 23, 2016, 9:43 p.m.)


Review request for samza, Navina Ramesh, Jagadish Venkatraman, and Yi Pan (Data 
Infrastructure).


Changes
-------

Updating with manual test results


Repository: samza


Description
-------

* Throw a SamzaContainerStartException whenever the container start request to 
the NM fails.
* Catch the exception in the Allocator and re-request for a container on 
ANY-HOST
* Release the failed container
* Also, refactored some methods in the ContainerUtil and ContainerAllocators to 
simplify the code.
* Finally, fix a logic bug in the HostAwareContainerAllocator wherein the 
expiration logic was flipped.


Diffs
-----

  samza-core/src/main/scala/org/apache/samza/config/TaskConfig.scala 
51e9e99e506c065938b18d081df6fe1ff24f79ec 
  
samza-yarn/src/main/java/org/apache/samza/job/yarn/AbstractContainerAllocator.java
 2e192eecf412525f45846edb62b23d261d35cbda 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerAllocator.java 
31fcc572d499d4e17cdf1340d7072c1562ccfdb7 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerRequestState.java 
54db5e5b2b1b109d202e814809adbfd2bc84fb4b 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/ContainerUtil.java 
91fae98c074e1648e7168fb8e76d6e1e656816fc 
  
samza-yarn/src/main/java/org/apache/samza/job/yarn/HostAwareContainerAllocator.java
 8e1db77db74e1dc01bb0ec64e102befd225123e9 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaAppState.java 
bc5b606394351e4039d084914fe5898e6ff148a1 
  
samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaContainerLaunchException.java
 PRE-CREATION 
  samza-yarn/src/main/java/org/apache/samza/job/yarn/SamzaTaskManager.java 
a3562a1155cbf24989200063b3ebd472b295db37 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/TestContainerAllocator.java 
e2b45d7577dea5a4a71af22521c93a7fd75eaefc 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/TestHostAwareContainerAllocator.java
 269d82479650b3bc2890d250da0391d34104b1eb 
  samza-yarn/src/test/java/org/apache/samza/job/yarn/TestSamzaTaskManager.java 
88d9f24d16fc3d9842b387cfc22edaf1dfa6fd06 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerAllocator.java
 5fcad82fc10f87510e994896ed7380f1e636a18f 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerListener.java
 8fc0b9898ed4484b9db98da7f622ab61f35cd4b1 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerRequestState.java
 e7441e512e73b5d09740f252dc52efece640bea4 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/util/MockContainerUtil.java 
4426ce6ac8808257dc00df50452df3d7555d6d0b 
  
samza-yarn/src/test/java/org/apache/samza/job/yarn/util/TestAMRMClientImpl.java 
951e0f9504b87263eba891d367c9806248241c7d 

Diff: https://reviews.apache.org/r/43350/diff/


Testing (updated)
-------

Added 2 unit tests and I'm doing some manual testing in parallel with this 
review. The manual test is to kill the NM for one of the containers of a job 
that uses Host Affinity. When the RM detects the outage, it should notify the 
AM which will try to restart the container on the same host. It will get a 
connection error and at that point should retry on a DIFFERENT host.

The testing was successful for both scenarios. 
1. RM is unaware that some NMs are down when job is starting:
   a. RM attempts to start AM on hosts, retrying up to the limit (50) in the 
case of connection errors to the NMs
   b. AM requests containers (accounting for host affinity) and tries to start 
the containers
   c. YarnClient retries connection errors (an excessive number of times) and 
then throws to the AM
   d. AM releases the bad container and requests a new one on ANY_HOST
   e. Container is eventually started on a good host. 
2. RM is unaware that some NMs are down when job is already running:
   a. NM-RM heartbeat interval expires
   b. RM is aware of the missing NMs and containers
   c. RM notifies AM of the completed containers
   d. AM requests new containers which the RM allocates only on good hosts 
(since it is already aware of the bad hosts)
   e. Container is started on a good host.
   
In both cases, the job stays alive and the full set of containers is eventually 
running on good hosts. Also, the state counters, like "neededContainers" are 
correct throughout the test.


Thanks,

Jake Maes

Reply via email to