Bence Kosztolnik created YARN-11656:
---------------------------------------
Summary: RMStateStore event queue blocked
Key: YARN-11656
URL: https://issues.apache.org/jira/browse/YARN-11656
Project: Hadoop YARN
Issue Type: Improvement
Components: yarn
Affects Versions: 3.4.1
Reporter: Bence Kosztolnik
Attachments: issue.png
I observed Yarn cluster has pending and available resources as well, but the
cluster utilization is usually around ~50%. The cluster had loaded with 200
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20
reduce containers configured, on a 50 nodes cluster, where each node had 8
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to
persist a RMStateStoreEvent (using FileSystemRMStateStore).
To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to
identify the problem if occurs on a cluster
{panel:title=Issue visible on UI2}
{panel}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]