[ https://issues.apache.org/jira/browse/FLINK-7590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bowen Li closed FLINK-7590. --------------------------- Resolution: Won't Fix > Flink failed to flush and close the file system output stream for > checkpointing because of s3 read timeout > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-7590 > URL: https://issues.apache.org/jira/browse/FLINK-7590 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing > Affects Versions: 1.3.2 > Reporter: Bowen Li > Fix For: 1.4.0, 1.3.3 > > > Flink job failed once over the weekend because of the following issue. It > picked itself up afterwards and has been running well. But the issue might > worth taking a look at. > {code:java} > 2017-09-03 13:18:38,998 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - reduce > (14/18) (c97256badc87e995d456e7a13cec5de9) switched from RUNNING to FAILED. > AsynchronousException{java.lang.Exception: Could not materialize checkpoint > 163 for operator reduce (14/18).} > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.Exception: Could not materialize checkpoint 163 for > operator reduce (14/18). > ... 6 more > Caused by: java.util.concurrent.ExecutionException: java.io.IOException: > Could not flush and close the file system output stream to > s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the > stream state handle > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897) > ... 5 more > Suppressed: java.lang.Exception: Could not properly cancel managed > keyed state future. > at > org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961) > ... 5 more > Caused by: java.util.concurrent.ExecutionException: > java.io.IOException: Could not flush and close the file system output stream > to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain > the stream state handle > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) > at > org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85) > at > org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88) > ... 7 more > Caused by: java.io.IOException: Could not flush and close the file > system output stream to s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa > in order to obtain the stream state handle > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399) > at > org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897) > ... 5 more > Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall > response (Failed to parse XML document with handler class > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler). > Response Code: 200, Response Text: OK > at > com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) > at > com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399) > at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) > at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) > at > com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50) > ... 4 more > Caused by: com.amazonaws.AmazonClientException: Failed to parse XML > document with handler class > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150) > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425) > at > com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200) > at > com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197) > at > com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62) > at > com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44) > at > com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30) > at > com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712) > ... 12 more > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at > sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) > at > sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) > at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) > at > org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) > at > org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266) > at > org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227) > at > org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186) > at > org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) > at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141) > ... 19 more > [CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the > file system output stream to > s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the > stream state handle] > 2017-09-03 13:18:39,000 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > com.offerup.stream_processing.item_view_stats.ItemViewStatsStreamingApp > (aac822203a47d504ecd9b73a77c60cd5) switched from state RUNNING to FAILING. > AsynchronousException{java.lang.Exception: Could not materialize checkpoint > 163 for operator reduce (14/18).} > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.Exception: Could not materialize checkpoint 163 for > operator reduce (14/18). > ... 6 more > Caused by: java.util.concurrent.ExecutionException: java.io.IOException: > Could not flush and close the file system output stream to > s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa > in order to obtain the stream state handle > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897) > ... 5 more > Suppressed: java.lang.Exception: Could not properly cancel managed > keyed state future. > at > org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961) > ... 5 more > Caused by: java.util.concurrent.ExecutionException: > java.io.IOException: Could not flush and close the file system output stream > to s3://xxxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain > the stream state handle > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) > at > org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85) > at > org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88) > ... 7 more > Caused by: java.io.IOException: Could not flush and close the file > system output stream to > s3://xxx/aac822203a47d504ecd9b73a77c60cd5/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa > in order to obtain the stream state handle > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:336) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeSnapshotStreamAndGetHandle(RocksDBKeyedStateBackend.java:693) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBFullSnapshotOperation.closeCheckpointStream(RocksDBKeyedStateBackend.java:531) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:420) > at > org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$3.performOperation(RocksDBKeyedStateBackend.java:399) > at > org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897) > ... 5 more > Caused by: com.amazonaws.AmazonClientException: Unable to unmarshall > response (Failed to parse XML document with handler class > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler). > Response Code: 200, Response Text: OK > at > com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) > at > com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399) > at > com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) > at > com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) > at > com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:2524) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.completeMultipartUpload(UploadMonitor.java:236) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.poll(UploadMonitor.java:183) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:152) > at > com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:50) > ... 4 more > Caused by: com.amazonaws.AmazonClientException: Failed to parse XML > document with handler class > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150) > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseCompleteMultipartUploadResponse(XmlResponsesSaxParser.java:425) > at > com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:200) > at > com.amazonaws.services.s3.model.transform.Unmarshallers$CompleteMultipartUploadResultUnmarshaller.unmarshall(Unmarshallers.java:197) > at > com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62) > at > com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:44) > at > com.amazonaws.services.s3.internal.ResponseHeaderHandlerChain.handle(ResponseHeaderHandlerChain.java:30) > at > com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712) > ... 12 more > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > at sun.security.ssl.InputRecord.read(InputRecord.java:503) > at > sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) > at > sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) > at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) > at > org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) > at > org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266) > at > org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227) > at > org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186) > at > org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) > at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$TrailingMiscDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) > at > com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141) > ... 19 more > [CIRCULAR REFERENCE:java.io.IOException: Could not flush and close the > file system output stream to > s3://xxx/chk-163/dcb9e1df-78e0-444a-9646-7701b25c1aaa in order to obtain the > stream state handle] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)