Hello, we have scenario with running Data Processing jobs that generates export files on demand. Our first approach was using ClusterClient, but recently we switched to REST API for job submittion. In the meantime we switched to flink 1.7.1 and that started to cause a problems. Some of our jobs are stuck, not processing any data. Task Managers have info that Chain is switching to RUNNING, and then nothing happenes. In TM's stdout logs we can see that for some reason log is cut, e.g.:
Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 615 records. Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 63 ms. row count = 615 Jan 10, 2019 4:28:33 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 140 records. Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Jan 10, 2019 4:28:33 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 140 Jan 10, 2019 4:28:33 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Jan 10, 2019 4:28:33 PM INFO: or As you can see, last line is cut in the middle, and nothing happenes later on. None of counters ( records/bytes sent/read) are increased. We switched debug on on both TMs and JM but only thing they are showing up are sending heartbeats between each other. Do you have any idea what could be a problem? and how we could deal with them or at least try to investigate? Is there any timeout/config that we could try to enable?