[jira] [Created] (FLINK-16069) Create TaskDeploymentDescriptor in future.

2020-02-14 Thread huweihua (Jira)
huweihua created FLINK-16069:


 Summary: Create TaskDeploymentDescriptor in future.
 Key: FLINK-16069
 URL: https://issues.apache.org/jira/browse/FLINK-16069
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: huweihua


The deploy of tasks will took long time when we submit a high parallelism job. 
And Execution#deploy run in mainThread, so it will block JobMaster process 
other akka messages, such as Heartbeat. The creation of 
TaskDeploymentDescriptor take most of time. We can put the creation in future.

For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException

2020-04-23 Thread huweihua (Jira)
huweihua created FLINK-17341:


 Summary: freeSlot in TaskExecutor.closeJobManagerConnection cause 
ConcurrentModificationException
 Key: FLINK-17341
 URL: https://issues.apache.org/jira/browse/FLINK-17341
 Project: Flink
  Issue Type: Bug
Reporter: huweihua


TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will modify 
the TaskSlotTable.slotsPerJob. this modify will cause 
ConcurrentModificationException.
{code:java}
Iterator activeSlots = taskSlotTable.getActiveSlots(jobId);

final FlinkException freeingCause = new FlinkException("Slot could not be 
marked inactive.");

while (activeSlots.hasNext()) {
 AllocationID activeSlot = activeSlots.next();

 try {
 if (!taskSlotTable.markSlotInactive(activeSlot, 
taskManagerConfiguration.getTimeout())) {
 freeSlotInternal(activeSlot, freeingCause);
 }
 } catch (SlotNotFoundException e) {
 log.debug("Could not mark the slot {} inactive.", jobId, e);
 }
}
{code}
 error log:
{code:java}
2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor
- Caught exception while executing runnable in main thread.
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698)
at 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager

2019-12-05 Thread huweihua (Jira)
huweihua created FLINK-15071:


 Summary: YARN vcore capacity check can not pass when use large 
slotPerTaskManager
 Key: FLINK-15071
 URL: https://issues.apache.org/jira/browse/FLINK-15071
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.9.0
Reporter: huweihua


The check of YARN vcore capacity in YarnClusterDescriptor.isReadyForDeployment 
can not pass If we config large slotsPerTaskManager(such as 96). The dynamic 
property yarn.containers.vcores does not work.

This is because we set dynamicProperties after check isReadyForDeployment.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)
huweihua created FLINK-26932:


 Summary: TaskManager hung in cleanupAllocationBaseDirs not exit.
 Key: FLINK-26932
 URL: https://issues.apache.org/jira/browse/FLINK-26932
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: huweihua


The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it. # 
Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.

 # 
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA!
 
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents

2022-02-28 Thread huweihua (Jira)
huweihua created FLINK-26395:


 Summary: The description of RAND_INTEGER is wrong in SQL function 
documents
 Key: FLINK-26395
 URL: https://issues.apache.org/jira/browse/FLINK-26395
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.14.3, 1.13.5, 1.15.0
Reporter: huweihua
 Attachments: image-2022-02-28-19-57-18-390.png

RAND_INTEGER will returns a integer value, but document of SQL function shows 
it will return a double value.

!image-2022-02-28-19-57-18-390.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice

2022-06-30 Thread huweihua (Jira)
huweihua created FLINK-28325:


 Summary: DataOutputSerializer#writeBytes increase position twice
 Key: FLINK-28325
 URL: https://issues.apache.org/jira/browse/FLINK-28325
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: huweihua
 Attachments: image-2022-06-30-18-14-50-827.png, 
image-2022-06-30-18-15-18-590.png

Hi, I was looking at the code and found that DataOutputSerializer.writeBytes 
increases the position twice, I feel it is a problem, please let me know if it 
is for a special purpose

org.apache.flink.core.memory.DataOutputSerializer#writeBytes

 

!image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)