[ https://issues.apache.org/jira/browse/FLINK-18811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171367#comment-17171367 ]
Kai Chen edited comment on FLINK-18811 at 8/5/20, 9:22 AM: ----------------------------------------------------------- I made some test today, and the change below looks good to me: {code:java} // SpillingAdaptiveSpanningRecordDeserializer.java private FileChannel createSpillingChannel() throws IOException { if (spillFile != null) { throw new IllegalStateException("Spilling file already exists."); } // try to find a unique file name for the spilling channel int maxAttempts = 10; String[] tempDirs = this.tempDirs; for (int attempt = 0; attempt < maxAttempts; attempt++) { int dirIndex = rnd.nextInt(tempDirs.length); String directory = tempDirs[dirIndex]; spillFile = new File(directory, randomString(rnd) + ".inputchannel"); try { if (spillFile.createNewFile()) { return new RandomAccessFile(spillFile, "rw").getChannel(); } } catch (IOException e) { // if there is no tempDir left to try if(tempDirs.length <= 1) { throw e; } LOG.warn("Caught an IOException when creating spill file: " + directory + ". Attempt " + attempt, e); tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex); } } {code} Does anyone have any good idea? was (Author: yuchuanchen): I made some test today, and the change below looks good to me: {code:java} // SpillingAdaptiveSpanningRecordDeserializer.java private FileChannel createSpillingChannel() throws IOException { if (spillFile != null) { throw new IllegalStateException("Spilling file already exists."); } // try to find a unique file name for the spilling channel int maxAttempts = 10; String[] tempDirs = this.tempDirs; for (int attempt = 0; attempt < maxAttempts; attempt++) { int dirIndex = rnd.nextInt(tempDirs.length); String directory = tempDirs[dirIndex]; spillFile = new File(directory, randomString(rnd) + ".inputchannel"); try { if (spillFile.createNewFile()) { return new RandomAccessFile(spillFile, "rw").getChannel(); } } catch (IOException e) { // if there is no tempDir left to try if(tempDirs.length <= 1) { throw e; } LOG.warn("Caught an IOException when creating spill file: " + directory + ". Attempt " + attempt, e); tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex); } } {code} Does anyone have any good ideas? > if a disk is damaged, taskmanager should choose another disk for temp dir , > rather than throw an IOException, which causes flink job restart over and > over again > ---------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-18811 > URL: https://issues.apache.org/jira/browse/FLINK-18811 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Environment: flink-1.10 > Reporter: Kai Chen > Priority: Major > Attachments: flink_disk_error.png > > > I met this Exception when a disk was damaged: > !flink_disk_error.png! > I think, if a disk is damaged, taskmanager should choose another disk for > temp dir , rather than throw an IOException, which causes flink job restart > over and over again. > If we could just change “SpillingAdaptiveSpanningRecordDeserializer" like > this: > {code:java} > // SpillingAdaptiveSpanningRecordDeserializer.java > private FileChannel createSpillingChannel() throws IOException { > if (spillFile != null) { > throw new IllegalStateException("Spilling file already exists."); > } > // try to find a unique file name for the spilling channel > int maxAttempts = 10; > String[] tempDirs = this.tempDirs; > for (int attempt = 0; attempt < maxAttempts; attempt++) { > int dirIndex = rnd.nextInt(tempDirs.length); > String directory = tempDirs[dirIndex]; > spillFile = new File(directory, randomString(rnd) + ".inputchannel"); > try { > if (spillFile.createNewFile()) { > return new RandomAccessFile(spillFile, "rw").getChannel(); > } > } catch (IOException e) { > // if there is no tempDir left to try > if(tempDirs.length <= 1) { > throw e; > } > LOG.warn("Caught an IOException when creating spill file: " + > directory + ". Attempt " + attempt, e); > tempDirs = (String[])ArrayUtils.remove(tempDirs,dirIndex); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)