Is it similar to an existing bug related to the interpreter processes getting stuck ? (wherein the workaround is to kill the application on yarn, restart the interpreter from the interface and then try resubmitting the query again). The problem in this case is that it is intermittently happening on some spark interpreters randomly. And since the driver app is not scheduled on yarn, there are no logs available to figure out the reason for this issue.
Thanks and Regards *Sarthak Sharma* DevOps Engineer, Media.Net +918002228376 | sarthak...@media.net <http://en-gb.facebook.com/people/Sarthak-Sharma/100006006014244> <http://in.linkedin.com/in/sarthaksharma96> On Tue, Nov 20, 2018 at 2:22 PM Jeff Zhang <zjf...@gmail.com> wrote: > If *zeppelin.interpreter.connect.timeout *is reached, but the yarn app is > still in ACCEPTED state, then this should be a bug. The yarn app should be > killed it it can not be created in the timeout threashold > > Sarthak Sharma <sarthak...@media.net> 于2018年11月20日周二 下午4:47写道: > >> Hey, >> >> Like you mentioned, I'm already using the *spark.yarn.queue* parameter, >> hence I know which yarn queue it is getting scheduled in and this queue has >> resources available for applications since other apps are also getting >> scheduled there. >> However, assuming the queue does NOT have resources for it to schedule >> within the given time frame causing it to throw an exception after the >> *zeppelin.interpreter.connect.timeout >> *is reached, the application should in any case get scheduled eventually >> which is not the case here. Interpreter driver process remains stuck in >> ACCEPTED state. Is there a change in the way it is implemented in this >> version ? Since we never experienced this on the previous one >> (zeppelin-0.7.3) where drivers would get scheduled eventually in their >> respective queues. >> >> On Tue, Nov 20, 2018, 7:29 AM Xun Liu <neliu...@163.com wrote: >> >>> HI,Sarthak Sharma >>> >>> The log shows that the task submitted by spark-submmit has been waiting >>> for execution in the queue of YARN. Is there no resource for the queue of >>> YARN? >>> You can specify a queue with resources in the spark interpreter via the >>> spark.yarn.queue parameter. >>> >>> >>> 在 2018年11月19日,下午7:41,Sarthak Sharma <sarthak...@media.net> 写道: >>> >>> Hi, >>> >>> We already have a zeppelin-0.7.3 setup which runs fine and is in use >>> currently but we are looking into the yarn cluster mode support for spark >>> interpreter in zeppelin-0.8. I've built it from source from *branch-0.8 >>> (As of Nov-15) *and am facing the following issues intermittently in >>> some of the spark interpreters while trying to use spark-sql on it. >>> >>> *18/11/19 10:04:07 INFO yarn.Client: Submitting application >>> application_1542587655772_35129 to ResourceManager* >>> *18/11/19 10:04:07 INFO impl.YarnClientImpl: Submitted application >>> application_1542587655772_35129* >>> *18/11/19 10:04:08 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:08 INFO yarn.Client:* >>> * client token: N/A* >>> * diagnostics: N/A* >>> * ApplicationMaster host: N/A* >>> * ApplicationMaster RPC port: -1* >>> * queue: root.zep* >>> * start time: 1542621847537* >>> * final status: UNDEFINED* >>> * tracking >>> URL: http://resource-manager-addr/proxy/application_1542587655772_35129/ >>> <http://c8-auto-hadoop-service-1.srv.media.net:8088/proxy/application_1542587655772_35129/>* >>> * user: sarthak.sh* >>> *18/11/19 10:04:09 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:10 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:11 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:12 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:13 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:14 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:15 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:16 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:17 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:18 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:19 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:20 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:21 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:22 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:23 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:24 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:25 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:26 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:27 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:28 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:29 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:30 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:31 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:32 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:33 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:34 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:35 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:36 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:37 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:38 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:39 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:40 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:41 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:42 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:43 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:44 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:45 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:46 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:47 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:48 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:49 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:50 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:51 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:52 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:53 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:54 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:55 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:56 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:57 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:58 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:04:59 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:00 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:01 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:02 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:03 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:04 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:05 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:06 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:07 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:08 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:09 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:10 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:11 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:12 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:13 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:14 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:15 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:16 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:17 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:18 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:19 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:20 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:21 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:22 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:23 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:24 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:25 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:26 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:27 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:28 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:29 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:30 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:31 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:32 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:33 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:34 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:35 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:36 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:37 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:38 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:39 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:40 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:41 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:42 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:43 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:44 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:45 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:46 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:47 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:48 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:49 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:50 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:51 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:52 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:53 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:54 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:55 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:56 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:57 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:58 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> *18/11/19 10:05:59 INFO yarn.Client: Application report for >>> application_1542587655772_35129 (state: ACCEPTED)* >>> >>> * at >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:205)* >>> * at >>> org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:64)* >>> * at >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:111)* >>> * at >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:164)* >>> * at >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:132)* >>> * at >>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:299)* >>> * at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:407)* >>> * at org.apache.zeppelin.scheduler.Job.run(Job.java:188)* >>> * at >>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:315)* >>> * at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)* >>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)* >>> * at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)* >>> * at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)* >>> * at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)* >>> * at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)* >>> * at java.lang.Thread.run(Thread.java:748)* >>> >>> Any further submit to this interpreter will give null pointer exceptions >>> due to the absence of an interpreter process. >>> It looks like the interpreter driver process while getting submitted to >>> yarn, is stuck in ACCEPTED state because of which we're not able to connect >>> to the remote interpreter process. This happens even if there are resources >>> on the cluster in yarn. >>> Also I've tried increasing the *zeppelin.interpreter.connect.timeout *but >>> that didn't help since the application is stuck in ACCEPTED state >>> indefinitely and there are no logs available too. >>> It'll be great if you can point me to something that can help. Also >>> please do let me know if any configuration files are required for debugging >>> this. >>> >>> >>> Thanks and Regards >>> >>> >>> *Sarthak Sharma* >>> DevOps Engineer, Media.Net <http://media.net/> >>> +918002228376 | sarthak...@media.net >>> <http://en-gb.facebook.com/people/Sarthak-Sharma/100006006014244> >>> <http://in.linkedin.com/in/sarthaksharma96> >>> >>> >>> > > -- > Best Regards > > Jeff Zhang >