Hi everyone: I met this Exception when a hard disk was damaged: https://issues.apache.org/jira/secure/attachment/13009035/13009035_flink_disk_error.png
I checked the code and found that flink will create a temp file when Record length > 5 MB: // SpillingAdaptiveSpanningRecordDeserializer.java if (nextRecordLength > THRESHOLD_FOR_SPILLING) { // create a spilling channel and put the data there this.spillingChannel = createSpillingChannel(); ByteBuffer toWrite = partial.segment.wrap(partial.position, numBytesChunk); FileUtils.writeCompletely(this.spillingChannel, toWrite); } The tempDir is random picked from all `tempDirs`. Well on yarn mode, one `tempDir` usually represents one hard disk. In may opinion, if a hard disk is damaged, taskmanager should pick another disk(tmpDir) for Spilling Channel, rather than throw an IOException, which causes flink job restart over and over again. I have created a jira issue ( https://issues.apache.org/jira/browse/FLINK-18811 ) to track this. And I'm looking forward someone could help review the code or discuss about this issue. thanks!