[jira] [Created] (FLINK-16069) Create TaskDeploymentDescriptor in future.
huweihua created FLINK-16069: Summary: Create TaskDeploymentDescriptor in future. Key: FLINK-16069 URL: https://issues.apache.org/jira/browse/FLINK-16069 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: huweihua The deploy of tasks will took long time when we submit a high parallelism job. And Execution#deploy run in mainThread, so it will block JobMaster process other akka messages, such as Heartbeat. The creation of TaskDeploymentDescriptor take most of time. We can put the creation in future. For example, A job [source(8000)->sink(8000)], the total 16000 tasks from SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException
huweihua created FLINK-17341: Summary: freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException Key: FLINK-17341 URL: https://issues.apache.org/jira/browse/FLINK-17341 Project: Flink Issue Type: Bug Reporter: huweihua TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will modify the TaskSlotTable.slotsPerJob. this modify will cause ConcurrentModificationException. {code:java} Iterator activeSlots = taskSlotTable.getActiveSlots(jobId); final FlinkException freeingCause = new FlinkException("Slot could not be marked inactive."); while (activeSlots.hasNext()) { AllocationID activeSlot = activeSlots.next(); try { if (!taskSlotTable.markSlotInactive(activeSlot, taskManagerConfiguration.getTimeout())) { freeSlotInternal(activeSlot, freeingCause); } } catch (SlotNotFoundException e) { log.debug("Could not mark the slot {} inactive.", jobId, e); } } {code} error log: {code:java} 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor - Caught exception while executing runnable in main thread. java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698) at org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652) at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149) at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager
huweihua created FLINK-15071: Summary: YARN vcore capacity check can not pass when use large slotPerTaskManager Key: FLINK-15071 URL: https://issues.apache.org/jira/browse/FLINK-15071 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.9.0 Reporter: huweihua The check of YARN vcore capacity in YarnClusterDescriptor.isReadyForDeployment can not pass If we config large slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores does not work. This is because we set dynamicProperties after check isReadyForDeployment. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
huweihua created FLINK-26932: Summary: TaskManager hung in cleanupAllocationBaseDirs not exit. Key: FLINK-26932 URL: https://issues.apache.org/jira/browse/FLINK-26932 Project: Flink Issue Type: Bug Components: Runtime / Coordination Reporter: huweihua The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. # Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. # Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA! !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents
huweihua created FLINK-26395: Summary: The description of RAND_INTEGER is wrong in SQL function documents Key: FLINK-26395 URL: https://issues.apache.org/jira/browse/FLINK-26395 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.14.3, 1.13.5, 1.15.0 Reporter: huweihua Attachments: image-2022-02-28-19-57-18-390.png RAND_INTEGER will returns a integer value, but document of SQL function shows it will return a double value. !image-2022-02-28-19-57-18-390.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice
huweihua created FLINK-28325: Summary: DataOutputSerializer#writeBytes increase position twice Key: FLINK-28325 URL: https://issues.apache.org/jira/browse/FLINK-28325 Project: Flink Issue Type: Bug Components: Runtime / Task Reporter: huweihua Attachments: image-2022-06-30-18-14-50-827.png, image-2022-06-30-18-15-18-590.png Hi, I was looking at the code and found that DataOutputSerializer.writeBytes increases the position twice, I feel it is a problem, please let me know if it is for a special purpose org.apache.flink.core.memory.DataOutputSerializer#writeBytes !image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)