Haibo Chen created YARN-10467:
---------------------------------
Summary: ContainerIdPBImpl objects can be leaked in
RMNodeImpl.completedContainers
Key: YARN-10467
URL: https://issues.apache.org/jira/browse/YARN-10467
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.10.0
Reporter: Haibo Chen
Assignee: Haibo Chen
In one of our recent heap analysis, we found that the majority of the heap is
occupied by {{RMNodeImpl.completedContainers}}<ContainerIdPBImp>, which
accounts for 19GB, out of 24.3 GB. There are over 86 million ContainerIdPBImpl
objects, in contrast, only 161,601 RMContainerImpl objects which represent the
# of active containers that RM is still tracking. Inspecting some
ContainerIdPBImpl objects, they belong to applications that have long finished.
This indicates some sort of memory leak of ContainerIdPBImpl objects in
RMNodeImpl.
Right now, when a container is reported by a NM as completed, it is immediately
added to RMNodeImpl.completedContainers and later cleaned up after the AM has
been notified of its completion in the AM-RM heartbeat. The cleanup can be
broken into a few steps.
* Step 1: the completed container is first added to
RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added to
{{RMNodeImpl.completedContainers}}).
* Step 2: During the heartbeat AM-RM heartbeat, the container is removed from
RMAppAttemptImpl.justFinishedContainers and added to
RMAppAttemptImpl.finishedContainersSentToAM
Once a completed container gets added to
RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned up
from {{RMNodeImpl.completedContainers}}
However, if the AM exits (regardless of failure or success) before some
recently completed containers can be added to
RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there won’t
be any future AM-RM heartbeat to perform aforementioned step 2. Hence, these
objects stay in RMNodeImpl.completedContainers forever.
We have observed in MR that AMs can decide to exit upon success of all it tasks
without waiting for notification of the completion of every container, or AM
may just die suddenly (e.g. OOM). Spark and other framework may just be
similar.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]