I'm trying to understand if behaviour of the flink jobmanager during
zookeeper upgrade is expected or not.

I'm running flink 1.11.2 in kubernetes, with zookeeper server 3.5.4-beta.
While I'm doing zookeeper upgrade, there is a 20 seconds zookeeper downtime.
I'd expect to either flink job to restart or few warnings in the logs during
those 20 seconds. Instead, I see whole flink JVM crash ( and later the pod
restart).

I expected for flink to internally retry zookeeper requests, so I'm
surprised it crashes. Is this expected, or is it a bug?

>From the logs

org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:00.197 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:00.198 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:02.294 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:02.295 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:02.295 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_192]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:03.841 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:03.842 UTC] INFO
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Socket connection established to zdzk.servicexxx/192.168.190.92:2181,
initiating session
[09-Feb-2021 11:30:03.842 UTC] WARN
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181,
unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.rea



FYI: I've asked same question on stackoverflow:
https://stackoverflow.com/questions/66120905/should-flink-job-manager-crash-during-zookeeper-upgrade



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to