Re: flink job will restart over and over again if a taskmanager's disk damages

Timo Walther Thu, 22 Oct 2020 00:58:59 -0700

Hi,

thanks for letting us know about this shortcoming.

I will link someone from the runtime team in the JIRA issue. Let'scontinue the discussion there.


Regards,
Timo

On 22.10.20 05:36, chenkaibit wrote:

Hi everyone:
  I met this Exception when a hard disk was damaged:
https://issues.apache.org/jira/secure/attachment/13009035/13009035_flink_disk_error.png
I checked the code and found that flink will create a temp file whenRecord length > 5 MB:
// SpillingAdaptiveSpanningRecordDeserializer.java if  (nextRecordLength > 
THRESHOLD_FOR_SPILLING) {
    // create a spilling channel and put the data there     
this.spillingChannel = createSpillingChannel();

    ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk);
    FileUtils.writeCompletely(this.spillingChannel, toWrite);
}
The tempDir is random picked from all `tempDirs`. Well on yarn mode, one`tempDir` usually represents one hard disk.In may opinion, if a hard disk is damaged, taskmanager should pickanother disk(tmpDir) for Spilling Channel, rather than throw anIOException, which causes flink job restart over and over again.
I have created a jira issue (https://issues.apache.org/jira/browse/FLINK-18811 ) to track this. AndI'm looking forward someone could help review the code or discuss aboutthis issue.
thanks!

Re: flink job will restart over and over again if a taskmanager's disk damages

Reply via email to