flink job will restart over and over again if a taskmanager's disk damages

chenkaibit Wed, 21 Oct 2020 20:37:58 -0700

Hi everyone:
 I met this Exception when a hard disk was damaged:
https://issues.apache.org/jira/secure/attachment/13009035/13009035_flink_disk_error.png



I checked the code and found that flink will create a temp file  when Record 
length > 5 MB:
// SpillingAdaptiveSpanningRecordDeserializer.java
if (nextRecordLength > THRESHOLD_FOR_SPILLING) {
   // create a spilling channel and put the data there
this.spillingChannel = createSpillingChannel();

   ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk);
   FileUtils.writeCompletely(this.spillingChannel, toWrite);
}


The tempDir is random picked from all `tempDirs`. Well on yarn mode, one 
`tempDir`  usually represents one hard disk.
In may opinion, if a hard disk is damaged, taskmanager should pick another 
disk(tmpDir) for Spilling Channel, rather than throw an IOException, which 
causes flink job restart over and over again.


I have created a jira issue ( https://issues.apache.org/jira/browse/FLINK-18811 
 )  to track this. And I'm looking forward someone could help review the code 
or discuss about this issue. 
thanks!

flink job will restart over and over again if a taskmanager's disk damages

Reply via email to