Kimahriman opened a new pull request, #6845:
URL: https://github.com/apache/hadoop/pull/6845

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   Stores containers pending log aggregation in the NodeManager state store so 
logs can still be aggregated for complete containers after a Node Manager 
restart. This undoes and replaces 
https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach 
that doesn't involve storing containers indefinitely until the application 
finishes. 
   
   The original approach has several issues, some of which were mentioned in 
the JIRA but decided it was ok:
   - Long running applications can lead to a large number of containers being 
stored indefinitely in the state store as well as in memory on the Node Manager
   - On restarts, the Node Manager has to do a lot of work fully recovering all 
of these complete containers just so they can be registered for log aggregation 
again
   - This leads to large heartbeat messages to the Resource Manager that can 
DoS or OOM it
   - This ignores the fact that users may not have log aggregation enabled or 
may have rolling log aggregation enabled, meaning containers are stored even 
after there is no need to worry about aggregating the logs in the future
   
   Instead, this adds a new state store entry for containers pending log 
aggregation. This solves all the above issues, while still providing the same 
guarantees about logs being aggregated after a Node Manager restart.
   
   ### How was this patch tested?
   New UTs added
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [x] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to