That was exactly the issue. I moved it (hard coded) to 10 seconds, and now all my interpreters start as expected with no issues.
So given this, perhaps 5 seconds, hard coded isn't a good idea long term here. Some options: 1. Provide a conf variable that can be used, default to 5, and allow it to be set globally to something else. 2. Set it per interpreter. Some interpreters may just need a little more time. This seems like more work, but also more flexible. 3. Provide a check before trying to connect to see if the port is listening. Perhaps check after 5, then wait 5 more. If it goes longer than X timeout value (with X being a variable in the config, with perhaps a default of 30) then error out. A side note, the restarting of the interpreter seems out of whack. You would think if the connection failed, that I could restart the interpreter and try again, but everytime that happened, I had to restart zeppelin before I could even attempt again. Thanks for the pointer, and glad I could find something here. I'd be interested in your thoughts on how to address. John On Sat, Jun 20, 2015 at 4:51 PM, moon soo Lee <m...@apache.org> wrote: > Thanks for explanation. > Zeppelin server daemon is creating a remote process and wait's for > interpreter process port being available for 5 seconds. > So, there is possibility that if your interpreter process is not created > and listening port in 5 seconds, It would have connection refused error. > > > https://github.com/apache/incubator-zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterProcess.java#L116 > > This is related source code. I think you can try increase the number from > 5*1000 to something bigger, and see how it works. > > Thanks, > moon > > > > On Sat, Jun 20, 2015 at 7:37 AM John Omernik <j...@omernik.com> wrote: > >> Thanks for the email Moon, I have gone through some pretty logical >> troubleshooting steps, but I can't seem to get this bug to occur >> consistently. Like I said, this is an interesting setup in that sometimes >> things work normally sometimes they don't >> >> When they don't start, and I check the interpreter logs, they say they >> are starting fine, say on port xyz, when I check xyz (this is all after the >> error) in netstat, I see it listening properly, and I even see a connection >> from localhost to it, but in the interface, I can't run any more paragraphs >> with that interpreter. Even if I refresh the whole page. >> >> One thought I had, and maybe you could help me on this... what is the >> process/time out to connect to a new interpreter? I.e. >> >> Step 1: Paragraph with interpreter that is not running is executed, >> Zeppelin sees it not running and it kicks off the new JVM with the >> interpreter >> Step 2: Interpreter starts >> Step 3: Zeppelin connects to the Interpreter >> >> I guess what is the process to go from Step 2 to Step3? Is there a delay >> in connection? Is there a retry? I.e. If the interpreter is starting, and >> lets set Zeppelin take 2 seconds after it starts the interpreter and tries >> to connect. If the interpreter isn't quite ready does it throw an error? >> Does it retry? Does it wait until the interpreter is 100% started before >> trying to connect? Is there a retry? >> >> Given the inconsistency, I was thinking timing may be an issue. These >> are servers that have quite a bit going on them, thus perhaps my >> interpreter starting is taking longer than Zeppelin would expect? >> >> >> >> On Fri, Jun 19, 2015 at 12:49 PM, moon soo Lee <m...@apache.org> wrote: >> >>> Hi, >>> >>> Thanks for sharing the problem. >>> >>> Zeppelin runs each interpreter instance as a separate JVM process and >>> communicate through thrift. Little detail is, Zeppelin server daemon invoke >>> interpreter JVM process with specific port and server daemon connect to >>> that port. Your error is that Zeppelin server can not connect to the >>> interpreter JVM process. Do you see any possibility that this process can >>> cause problem on your system? >>> >>> About the same variable name in markdown and hive interpreter, it won't >>> be a problem. >>> >>> Thanks, >>> moon >>> >>> >>> >>> On Fri, Jun 19, 2015 at 9:34 AM John Omernik <j...@omernik.com> wrote: >>> >>>> Another thing that may or may not be related is on the server running >>>> Zeppelin, I have multiple interfaces, it "appears" the interpreter binds on >>>> all interfaces, but what about the connection? Does that come from a >>>> specific interface? Could that be causing the connection refused? (I have >>>> two eth interfaces and a docker0 interface on this node) >>>> >>>> John >>>> >>>> >>>> On Fri, Jun 19, 2015 at 8:02 AM, John Omernik <j...@omernik.com> wrote: >>>> >>>>> I am not an expert in Java, but could there be an issue using the >>>>> markdown and the hive interpreters together because they share a variable >>>>> name (md = markdown object in %markdown and md = metatdata in %hive) >>>>> >>>>> >>>>> >>>>> markdown: >>>>> >>>>> public void open() { md = new Markdown4jProcessor(); } >>>>> >>>>> hive: >>>>> >>>>> try { ResultSetMetaData md = res.getMetaData(); for (int i = 1; i < md >>>>> .getColumnCount() + 1; i++) { if (i == 1) { msg.append(md. >>>>> getColumnName(i)); } else { msg.append("\t" + md.getColumnName(i)); } >>>>> } >>>>> >>>>> On Fri, Jun 19, 2015 at 6:56 AM, John Omernik <j...@omernik.com> >>>>> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> I am working with three primary interpreters, %md, %pyspark, and >>>>>> %hive. What I am noticing is with my current config, sometimes an >>>>>> interpreter will start other times, I'll get an errors below. I wish I >>>>>> could say what the rhyme or reason was. >>>>>> >>>>>> If I get the errors, then I have to restart Zeppelin before it will >>>>>> work (or even attempt to work). I've tried clicking "restart interpreter" >>>>>> in the interpreters tab, it seems to work, but when I go back to a >>>>>> notebook >>>>>> I get "Scheduler already terminated" >>>>>> >>>>>> What's interesting here, is other than a restart, I can run the cells >>>>>> (I have three one for each interpreter) in different orders and get >>>>>> different results, sometimes if I run %hive first, it works, then >>>>>> %pyspark, >>>>>> that will work too then %md will fail. (Note these are the SAME commands, >>>>>> on the same server, same config etc). >>>>>> >>>>>> Other times, I can get them to run no matter what, it's very >>>>>> inconsistent, and combined with the fact that once an interpreter fails, >>>>>> there is no getting it back until the whole server is restarted. >>>>>> >>>>>> Also of note here: I am running a recently compiled version of this >>>>>> (I downloaded this on Wed) using git clone) >>>>>> >>>>>> Any help would be appreciated in determining how to troubleshoot this! >>>>>> >>>>>> John >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Example from %md >>>>>> >>>>>> *In Notebook error* >>>>>> >>>>>> >>>>>> >>>>>> %md >>>>>> #For the Love of Jeezy Pete >>>>>> >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>> org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>> org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> *In Running Shell Window (where I ran bin/zeppelin.sh)* >>>>>> >>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: >>>>>> Connection refused >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>> >>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>> >>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>> >>>>>> at >>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>> >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: >>>>>> Connection refused >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133) >>>>>> >>>>>> ... 12 more >>>>>> >>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: Connection refused >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>> >>>>>> ... 19 more >>>>>> >>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>> >>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>> >>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>> >>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>> >>>>>> ... 20 more >>>>>> >>>>>> *from interpreter log file:* >>>>>> >>>>>> INFO [2015-06-19 06:44:29,134] ({Thread-0} >>>>>> RemoteInterpreterServer.java[run]:95) - Starting remote interpreter >>>>>> server >>>>>> on port 54930 >>>>>> >>>>>> >>>>>> *From Zeppelin Log file:* >>>>>> >>>>>> INFO [2015-06-19 06:44:19,329] ({pool-1-thread-2} >>>>>> SchedulerFactory.java[jobStarted]:132) - Job >>>>>> paragraph_1434713440246_1991176208 started by scheduler >>>>>> remoteinterpreter_328619575 >>>>>> >>>>>> INFO [2015-06-19 06:44:19,331] ({pool-1-thread-2} >>>>>> Paragraph.java[jobRun]:194) - run paragraph 20150619-063040_649381067 >>>>>> using >>>>>> md org.apache.zeppelin.interpreter.LazyOpenInterpreter@38946f29 >>>>>> >>>>>> INFO [2015-06-19 06:44:19,341] ({pool-1-thread-2} >>>>>> RemoteInterpreterProcess.java[reference]:107) - Run interpreter process >>>>>> /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/bin/interpreter.sh >>>>>> -d /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/interpreter/md >>>>>> -p >>>>>> 54930 >>>>>> >>>>>> ERROR [2015-06-19 06:44:24,399] ({Thread-35} >>>>>> RemoteScheduler.java[getStatus]:226) - Can't get status information >>>>>> >>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: >>>>>> Connection refused >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:224) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:183) >>>>>> >>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: Connection refused >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>> >>>>>> ... 8 more >>>>>> >>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>> >>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>> >>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>> >>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>> >>>>>> ... 9 more >>>>>> >>>>>> ERROR [2015-06-19 06:44:24,399] ({pool-1-thread-2} Job.java[run]:183) >>>>>> - Job failed >>>>>> >>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: >>>>>> Connection refused >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>> >>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>> >>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>> >>>>>> at >>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>> >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>> >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException: >>>>>> org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: >>>>>> Connection refused >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>> >>>>>> at >>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133) >>>>>> >>>>>> ... 12 more >>>>>> >>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>> java.net.ConnectException: Connection refused >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>> >>>>>> at >>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>> >>>>>> ... 19 more >>>>>> >>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>> >>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>> >>>>>> at >>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>> >>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>> >>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>> >>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>> >>>>>> ... 20 more >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>