John created FLINK-12106:
----------------------------

             Summary: Jobmanager is killing FINISHED taskmanger containers, 
causing exception in still running Taskmanagers an
                 Key: FLINK-12106
                 URL: https://issues.apache.org/jira/browse/FLINK-12106
             Project: Flink
          Issue Type: Bug
          Components: Deployment / YARN
    Affects Versions: 1.7.2
         Environment: Hadoop:  hdp/2.5.6.0-40

Flink: 2.7.2
            Reporter: John


When running a single flink job on YARN, some of the taskmanger containers 
reach the FINISHED state before others.  It appears that, after receiving final 
execution state FINISHED from a taskmanager, jobmanager is waiting ~68 seconds 
and then freeing the associated slot in the taskmanager.  After and additional 
60 seconds, jobmanager is stopping the same taskmanger because TaskExecutor 
exceeded the idle timeout.

Meanwhile, other taskmangers are still working to complete the job.  Within 10 
seconds after the taskmanger container above is stopped, the remaining task 
managers receive an exception due to loss of connection to the stopped 
taskmanager.  These exceptions result job failure.

 

Relevant logs:

2019-04-03 13:49:00,013 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Registering TaskManager with ResourceID 
container_1553017480503_0158_01_000038 
(akka.tcp://flink@hadoop4:42745/user/taskmanager_0) at ResourceManager

2019-04-03 13:49:05,900 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Registering TaskManager with ResourceID 
container_1553017480503_0158_01_000059 
(akka.tcp://flink@hadoop9:55042/user/taskmanager_0) at ResourceManager

 

 

2019-04-03 13:48:51,132 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Received new container: container_1553017480503_0158_01_000077 - 
Remaining pending container requests: 6

2019-04-03 13:48:52,862 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner      
            -     
-Dlog.file=/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000077/taskmanager.log

2019-04-03 13:48:57,490 INFO  
org.apache.flink.runtime.io.network.netty.NettyServer         - Successful 
initialization (took 202 ms). Listening on SocketAddress /192.168.230.69:40140.

2019-04-03 13:49:12,575 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Registering TaskManager with ResourceID 
container_1553017480503_0158_01_000077 
(akka.tcp://flink@hadoop9:51525/user/taskmanager_0) at ResourceManager

2019-04-03 13:49:12,631 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Allocated slot 
for AllocationID\{42fed3e5a136240c23cc7b394e3249e9}.

2019-04-03 14:58:15,188 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering 
task and sending final execution state FINISHED to JobManager for task DataSink 
(com.anovadata.alexflinklib.sinks.bucketing.BucketingOutputFormat@26874f2c) 
a4b5fb32830d4561147b2714828109e2.

2019-04-03 14:59:23,049 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing idle 
slot [AllocationID\{42fed3e5a136240c23cc7b394e3249e9}].

2019-04-03 14:59:23,058 INFO  
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot 
TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
networkMemoryInMB=2147483647}, allocationId: 
AllocationID\{42fed3e5a136240c23cc7b394e3249e9}, jobId: 
a6c4e367698c15cdf168d19a89faff1d).

2019-04-03 15:00:02,641 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Stopping container container_1553017480503_0158_01_000077.

2019-04-03 15:00:02,646 INFO  org.apache.flink.yarn.YarnResourceManager         
            - Closing TaskExecutor connection 
container_1553017480503_0158_01_000077 because: TaskExecutor exceeded the idle 
timeout.

 

 

2019-04-03 13:48:48,902 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner      
            -     
-Dlog.file=/data1/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000059/taskmanager.log

2019-04-03 14:59:24,677 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
columnStore to file. allocated memory: 109479981

2019-04-03 15:00:05,696 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 
135014409 > 134217728: flushing 1930100 records to disk.

2019-04-03 15:00:05,696 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
columnStore to file. allocated memory: 102677684

2019-04-03 15:00:08,671 ERROR org.apache.flink.runtime.operators.BatchTask      
            - Error in task code:  CHAIN Partition -> FlatMap 

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This indicates 
that the remote task manager was lost.

2019-04-03 15:00:08,714 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering 
task and sending final execution state FAILED to JobManager for task CHAIN 
Partition -> FlatMap

2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task         
            - Attempting to cancel task DataSink ()

2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task         
            - DataSink () switched from RUNNING to CANCELING.

2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task         
            - Triggering cancellation of task code DataSink ()

 

 

2019-04-03 13:48:44,562 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner      
            -     
-Dlog.file=/data8/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000038/taskmanager.log

2019-04-03 14:59:18,620 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
columnStore to file. allocated memory: 0

2019-04-03 14:59:48,088 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 
136179972 > 134217728: flushing 1930100 records to disk.

2019-04-03 14:59:48,088 INFO  
org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
columnStore to file. allocated memory: 103333893

2019-04-03 15:00:08,692 ERROR org.apache.flink.runtime.operators.BatchTask      
            - Error in task code:  CHAIN Partition -> FlatMap

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This indicates 
that the remote task manager was lost.

2019-04-03 15:00:08,741 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering 
task and sending final execution state FAILED to JobManager for task CHAIN 
Partition -> FlatMap

2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task         
            - Attempting to cancel task DataSink ()

2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task         
            - DataSink () switched from RUNNING to CANCELING.

2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task         
            - Triggering cancellation of task code DataSink ()

 

 

2019-04-03 15:00:09,196 INFO  
org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job 
a6c4e367698c15cdf168d19a89faff1d reached globally terminal state FAILED.

 

  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to