Hi,

This exception looks like it was thrown by a downstream Task/TaskManager
when trying to read a message/packet from some upstream Task/TaskManager
and that connection between two TaskManagers was reseted (closed abruptly).
So it's the case:
> involves communicating with other non-collocated tasks running on other
taskmanagers

Piotrek

wt., 8 gru 2020 o 18:56 Kye Bae <kye....@capitalone.com> napisał(a):

> Hello, Piotr.
>
> Thank you.
>
> This is an error logged to the taskmanager just before it became "lost" to
> the jobmanager (i.e., reported as "lost" in the jobmanager log just before
> the job restart). In what context would this particular error (not the
> root-root cause you referred to) be thrown from a taskmanager? E.g., any
> point in the pipeline that involves communicating with other non-collocated
> tasks running on other taskmanagers? Or with the jobmanager?
>
> -K
>
> On Tue, Dec 8, 2020 at 3:19 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
>> Hi Kye,
>>
>> Almost for sure this error is not the primary cause of the failure. This
>> error means that the node reporting it, has detected some fatal failure on
>> the other side of the wire (connection reset by peer), but the original
>> error is somehow too slow or unable to propagate to the JobManager before
>> this secondary exception. Something else must have failed/crashed/caused,
>> so you should look for that something. This something can be:
>> 1. TaskManager on the other end has crashed with some error - please look
>> for some errors or warning in other task managers logs
>> 2. OOM or some other JVM failure - again please look at the logs on other
>> machines (maybe system logs)
>> 3. Some OS failure - please look at the system logs on other machines
>> 4. Some hardware failure (restart / crash)
>> 5. Network problems
>>
>> Piotrek
>>
>> pon., 7 gru 2020 o 23:31 Kye Bae <kye....@capitalone.com> napisał(a):
>>
>>> I forgot to mention: this is Flink 1.10.
>>>
>>> -K
>>>
>>> On Mon, Dec 7, 2020 at 5:08 PM Kye Bae <kye....@capitalone.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> We have a real-time streaming workflow that has been running for about
>>>> 2.5 weeks.
>>>>
>>>> Then, we began to get the exception below from taskmanagers (random)
>>>> since yesterday, and the job began to fail/restart every hour or so.
>>>>
>>>> The job does recover after each restart, but sometimes it takes more
>>>> time to recover than allowed in our environment. On a few occasions, it
>>>> took more than a few restarts to fully recover.
>>>>
>>>> Can you provide some insight into what this error means and also what
>>>> we can do to prevent this in future?
>>>>
>>>> Thank you!
>>>>
>>>> +++
>>>> ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue
>>>> - Encountered error while consuming partitions
>>>> java.io
>>>> <https://urldefense.com/v3/__http://java.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_LEtbBVlg$>.IOException:
>>>> Connection reset by peer
>>>> at sun.nio.ch
>>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$>.FileDispatcherImpl.read0(Native
>>>> Method)
>>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>> at sun.nio.ch
>>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$>
>>>> .IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>> at org.apache.flink.shaded.netty4.io
>>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>>> .netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
>>>> at org.apache.flink.shaded.netty4.io
>>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>>> .netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>>>> at
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>>>> at org.apache.flink.shaded.netty4.io
>>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>>> .netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>>>> at org.apache.flink.shaded.netty4.io
>>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>>> .netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>> at java.lang.Thread.run(Thread.java:748)
>>>>
>>> ------------------------------
>>>
>>> The information contained in this e-mail is confidential and/or
>>> proprietary to Capital One and/or its affiliates and may only be used
>>> solely in performance of work or services for Capital One. The information
>>> transmitted herewith is intended only for use by the individual or entity
>>> to which it is addressed. If the reader of this message is not the intended
>>> recipient, you are hereby notified that any review, retransmission,
>>> dissemination, distribution, copying or other use of, or taking of any
>>> action in reliance upon this information is strictly prohibited. If you
>>> have received this communication in error, please contact the sender and
>>> delete the material from your computer.
>>>
>>>
>>>
>>>
>>> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>

Reply via email to