Hi, This exception looks like it was thrown by a downstream Task/TaskManager when trying to read a message/packet from some upstream Task/TaskManager and that connection between two TaskManagers was reseted (closed abruptly). So it's the case: > involves communicating with other non-collocated tasks running on other taskmanagers
Piotrek wt., 8 gru 2020 o 18:56 Kye Bae <kye....@capitalone.com> napisał(a): > Hello, Piotr. > > Thank you. > > This is an error logged to the taskmanager just before it became "lost" to > the jobmanager (i.e., reported as "lost" in the jobmanager log just before > the job restart). In what context would this particular error (not the > root-root cause you referred to) be thrown from a taskmanager? E.g., any > point in the pipeline that involves communicating with other non-collocated > tasks running on other taskmanagers? Or with the jobmanager? > > -K > > On Tue, Dec 8, 2020 at 3:19 AM Piotr Nowojski <pnowoj...@apache.org> > wrote: > >> Hi Kye, >> >> Almost for sure this error is not the primary cause of the failure. This >> error means that the node reporting it, has detected some fatal failure on >> the other side of the wire (connection reset by peer), but the original >> error is somehow too slow or unable to propagate to the JobManager before >> this secondary exception. Something else must have failed/crashed/caused, >> so you should look for that something. This something can be: >> 1. TaskManager on the other end has crashed with some error - please look >> for some errors or warning in other task managers logs >> 2. OOM or some other JVM failure - again please look at the logs on other >> machines (maybe system logs) >> 3. Some OS failure - please look at the system logs on other machines >> 4. Some hardware failure (restart / crash) >> 5. Network problems >> >> Piotrek >> >> pon., 7 gru 2020 o 23:31 Kye Bae <kye....@capitalone.com> napisał(a): >> >>> I forgot to mention: this is Flink 1.10. >>> >>> -K >>> >>> On Mon, Dec 7, 2020 at 5:08 PM Kye Bae <kye....@capitalone.com> wrote: >>> >>>> Hello! >>>> >>>> We have a real-time streaming workflow that has been running for about >>>> 2.5 weeks. >>>> >>>> Then, we began to get the exception below from taskmanagers (random) >>>> since yesterday, and the job began to fail/restart every hour or so. >>>> >>>> The job does recover after each restart, but sometimes it takes more >>>> time to recover than allowed in our environment. On a few occasions, it >>>> took more than a few restarts to fully recover. >>>> >>>> Can you provide some insight into what this error means and also what >>>> we can do to prevent this in future? >>>> >>>> Thank you! >>>> >>>> +++ >>>> ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue >>>> - Encountered error while consuming partitions >>>> java.io >>>> <https://urldefense.com/v3/__http://java.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_LEtbBVlg$>.IOException: >>>> Connection reset by peer >>>> at sun.nio.ch >>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$>.FileDispatcherImpl.read0(Native >>>> Method) >>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>>> at sun.nio.ch >>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$> >>>> .IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >>>> at org.apache.flink.shaded.netty4.io >>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$> >>>> .netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247) >>>> at org.apache.flink.shaded.netty4.io >>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$> >>>> .netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549) >>>> at >>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) >>>> at org.apache.flink.shaded.netty4.io >>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$> >>>> .netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) >>>> at org.apache.flink.shaded.netty4.io >>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$> >>>> .netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>> at java.lang.Thread.run(Thread.java:748) >>>> >>> ------------------------------ >>> >>> The information contained in this e-mail is confidential and/or >>> proprietary to Capital One and/or its affiliates and may only be used >>> solely in performance of work or services for Capital One. The information >>> transmitted herewith is intended only for use by the individual or entity >>> to which it is addressed. If the reader of this message is not the intended >>> recipient, you are hereby notified that any review, retransmission, >>> dissemination, distribution, copying or other use of, or taking of any >>> action in reliance upon this information is strictly prohibited. If you >>> have received this communication in error, please contact the sender and >>> delete the material from your computer. >>> >>> >>> >>> >>> ------------------------------ > > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. > > > > >