Re: flink list and flink run commands timeout

Gary Yao Wed, 05 Sep 2018 22:15:14 -0700

Hi Jason,

>From the stacktrace it seems that you are using the 1.4.0 client to list
jobs
on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.


Best,
Gary

On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <jason.ka...@ymail.com> wrote:

> Hello,
>
> Thanks for the response. I had already tried setting the log level to
> debug in log4j-cli.properties, logback-console.xml, and 
> log4j-console.properties
> but no additional relevant information comes out. On the server, all that
> comes out are zookeeper ping responses:
>
> 2018-09-05 15:16:56,786 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping response for
> sessionid: 0x3659b60bcb50076 after 1ms
>
> The client log indicates only the following (but we are not using hadoop):
>
> 2018-09-05 15:19:53,339 WARN  org.apache.flink.client.cli.CliFrontend
>                    - Could not load CLI class org.apache.flink.yarn.cli.
> FlinkYarnSessionCli.
> java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:264)
>         at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(
> CliFrontend.java:1208)
>         at org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(
> CliFrontend.java:1164)
>         at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.
> java:1090)
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.
> Configuration
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         ... 5 more
>
>
> and
>
> 2018-09-05 15:19:53,881 ERROR org.apache.flink.shaded.
> curator.org.apache.curator.ConnectionState  - Authentication failed
>
>
> despite the zookeeper being configured as 'open' and latest logs showing
> data being read from zookeeper.
>
> 2018-09-05 15:19:54,274 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Reading reply
> sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null
> finished:false header:: 1,3  replyHeader:: 1,47244656277,0  request::
> '/flink,F  response:: s{47244656196,47244656196,
> 1536110417531,1536110417531,0,1,0,0,0,1,47244656197}
>
>
> Much like the basic log output, the detailed trace shows no additional
> information, just a gap after waiting for the response:
>
> 2018-09-05 15:19:54,313 INFO  org.apache.flink.client.cli.CliFrontend
>                    - Waiting for response...
> 2018-09-05 15:20:07,635 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping response for
> sessionid: 0x265a12437df0074 after 1ms
> 2018-09-05 15:20:20,976 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping response for
> sessionid: 0x265a12437df0074 after 1ms
> 2018-09-05 15:20:24,311 INFO  org.apache.flink.runtime.rest.RestClient
>                   - Shutting down rest endpoint.
> 2018-09-05 15:20:24,317 INFO  org.apache.flink.runtime.rest.RestClient
>                   - Rest endpoint shutdown complete.
> 2018-09-05 15:20:24,318 INFO  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService
> /leader/rest_server_lock.
> 2018-09-05 15:20:24,320 INFO  org.apache.flink.runtime.leaderretrieval.
> ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService
> /leader/dispatcher_lock.
> 2018-09-05 15:20:24,320 DEBUG org.apache.flink.shaded.
> curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Closing
> 2018-09-05 15:20:24,321 INFO  org.apache.flink.shaded.
> curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  -
> backgroundOperationsLoop exiting
> 2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.
> curator.org.apache.curator.CuratorZookeeperClient  - Closing
> 2018-09-05 15:20:24,322 DEBUG org.apache.flink.shaded.
> curator.org.apache.curator.ConnectionState  - Closing
> 2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ZooKeeper  - Closing session:
> 0x265a12437df0074
> 2018-09-05 15:20:24,323 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Closing client for session:
> 0x265a12437df0074
> 2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Reading reply
> sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null
> finished:false header:: 11,-11  replyHeader:: 11,47244656278,0  request::
> null response:: null
> 2018-09-05 15:20:24,329 DEBUG org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - Disconnecting client for
> session: 0x265a12437df0074
> 2018-09-05 15:20:24,330 INFO  org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x265a12437df0074
> closed
> 2018-09-05 15:20:24,330 INFO  org.apache.flink.shaded.
> zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for
> session: 0x265a12437df0074
> 2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend
>                    - Error while running the command.
>
>
>
>
> On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler <
> ches...@apache.org> wrote:
>
>
> Please enable DEBUG logging for the client and TRACE logging for the
> cluster.
>
> For the client, look for log messages starting with "Sending request of",
> this will contain the host and port that requests are sent to by the
> client. Verify that these are correct and match the host/port that you use
> when accessing the web UI.
>
> For the server, look for log messages starting with "Received request",
> so we can figure out whether the request at least arrives.
>
> On 05.09.2018 00:53, Jason Kania wrote:
>
> I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster
> configured with HA. Now I am encountering an issue where the flink command
> line operations timeout. The exception generated is very poor because it
> only indicates a timeout and not the reason or what it was trying to do:
>
> >./flink list -f
> Waiting for response...
>
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.util.FlinkException: Failed to retrieve job list.
>         at org.apache.flink.client.cli.CliFrontend.listJobs(
> CliFrontend.java:433)
>         at org.apache.flink.client.cli.CliFrontend.lambda$list$0(
> CliFrontend.java:416)
>         at org.apache.flink.client.cli.CliFrontend.runClusterAction(
> CliFrontend.java:960)
>         at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.
> java:413)
>         at org.apache.flink.client.cli.CliFrontend.parseParameters(
> CliFrontend.java:1028)
>         at org.apache.flink.client.cli.CliFrontend.lambda$main$9(
> CliFrontend.java:1101)
>         at org.apache.flink.runtime.security.NoOpSecurityContext.
> runSecured(NoOpSecurityContext.java:30)
>         at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.
> java:1101)
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Exception is not retryable.
>         at org.apache.flink.runtime.concurrent.FutureUtils.lambda$
> retryOperationWithDelay$5(FutureUtils.java:213)
>         at java.util.concurrent.CompletableFuture.uniWhenComplete(
> CompletableFuture.java:760)
>         at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(
> CompletableFuture.java:736)
>         at java.util.concurrent.CompletableFuture.postComplete(
> CompletableFuture.java:474)
>         at java.util.concurrent.CompletableFuture.completeExceptionally(
> CompletableFuture.java:1977)
>         at org.apache.flink.runtime.concurrent.FutureUtils$
> Timeout.run(FutureUtils.java:793)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.CompletionException: java.util.concurrent.
> TimeoutException
>         at java.util.concurrent.CompletableFuture.encodeThrowable(
> CompletableFuture.java:292)
>         at java.util.concurrent.CompletableFuture.completeThrowable(
> CompletableFuture.java:308)
>         at java.util.concurrent.CompletableFuture.uniApply(
> CompletableFuture.java:593)
>         at java.util.concurrent.CompletableFuture$UniApply.
> tryFire(CompletableFuture.java:577)
>         ... 10 more
> Caused by: java.util.concurrent.TimeoutException
>
> The web interface shows the 2 job managers and 3 task managers that are
> talking with one another.
>
> I have looked at the zookeeper data and it is all present.
>
> I have tried running the command on multiple nodes and they all give the
> same error.
>
> I looked for a verbose or debug option for the commands but found nothing.
>
> Suggestions on this?
>
> Thanks,
>
> Jason
>
>
>

Re: flink list and flink run commands timeout

Reply via email to