[
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tarun Parimi resolved YARN-10440.
---------------------------------
Resolution: Duplicate
Seems to be similar to YARN-8513 . The default config change in YARN-8896 fixes
it. Try setting
{noformat}
yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments=100{noformat}
Reopen with jstack dump, if issue reoccurs with the config change.
> resource manager hangs,and i cannot submit any new jobs,but rm and nm
> processes are normal
> ------------------------------------------------------------------------------------------
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.1
> Reporter: jufeng li
> Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal.
> I can open xxxxx:8088/cluster/apps/RUNNING but can not
> xxxxx:8088/cluster/scheduler.Those apps submited can not end itself and new
> apps can not be submited.just everything hangs but not RM,NM server. How can
> I fix this?help me,please!
>
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_000001 container=null queue=tianqiwang
> clusterResource=<memory:10240000, vCores:4800> type=NODE_LOCAL
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation
> proposal
> 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) -
> assignedContainer application attempt=appattempt_1600074574138_66297_000001
> container=null queue=tianqiwang clusterResource=<memory:10240000,
> vCores:4800> type=NODE_LOCAL requestedPartition=
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]