Kimahriman opened a new pull request, #6845:
URL: https://github.com/apache/hadoop/pull/6845
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
Stores containers pending log aggregation in the NodeManager state store so
logs can still be aggregated for complete containers after a Node Manager
restart. This undoes and replaces
https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach
that doesn't involve storing containers indefinitely until the application
finishes.
The original approach has several issues, some of which were mentioned in
the JIRA but decided it was ok:
- Long running applications can lead to a large number of containers being
stored indefinitely in the state store as well as in memory on the Node Manager
- On restarts, the Node Manager has to do a lot of work fully recovering all
of these complete containers just so they can be registered for log aggregation
again
- This leads to large heartbeat messages to the Resource Manager that can
DoS or OOM it
- This ignores the fact that users may not have log aggregation enabled or
may have rolling log aggregation enabled, meaning containers are stored even
after there is no need to worry about aggregating the logs in the future
Instead, this adds a new state store entry for containers pending log
aggregation. This solves all the above issues, while still providing the same
guarantees about logs being aggregated after a Node Manager restart.
### How was this patch tested?
New UTs added
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [x] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [x] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]