That was exactly the issue. I moved it (hard coded) to 10 seconds, and now
all my interpreters start as expected with no issues.

So given this, perhaps 5 seconds, hard coded isn't a good idea long term
here.  Some options:

1. Provide a conf variable that can be used, default to 5, and allow it to
be set globally to something else.
2. Set it per interpreter. Some interpreters may just need a little more
time.  This seems like more work, but also more flexible.
3. Provide a check before trying to connect to see if the port is
listening.  Perhaps check after 5, then wait 5 more. If it goes longer than
X timeout value (with X being a variable in the config, with perhaps a
default of 30) then error out.

A side note, the restarting of the interpreter seems out of whack. You
would think if the connection failed, that I could restart the interpreter
and try again, but everytime that happened, I had to restart zeppelin
before I could even attempt again.

Thanks for the pointer, and glad I could find something here.  I'd be
interested in your thoughts on how to address.

John




On Sat, Jun 20, 2015 at 4:51 PM, moon soo Lee <m...@apache.org> wrote:

> Thanks for explanation.
> Zeppelin server daemon is creating a remote process and wait's for
> interpreter process port being available for 5 seconds.
> So, there is possibility that if your interpreter process is not created
> and listening port in 5 seconds, It would have connection refused error.
>
>
> https://github.com/apache/incubator-zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterProcess.java#L116
>
> This is related source code. I think you can try increase the number from
> 5*1000 to something bigger, and see how it works.
>
> Thanks,
> moon
>
>
>
> On Sat, Jun 20, 2015 at 7:37 AM John Omernik <j...@omernik.com> wrote:
>
>> Thanks for the email Moon, I have gone through some pretty logical
>> troubleshooting steps, but I can't seem to get this bug to occur
>> consistently. Like I said, this is an interesting setup in that sometimes
>> things work normally sometimes they don't
>>
>> When they don't start, and I check the interpreter logs, they say they
>> are starting fine, say on port xyz, when I check xyz (this is all after the
>> error) in netstat, I see it listening properly, and I even see a connection
>> from localhost to it, but in the interface, I can't run any more paragraphs
>> with that interpreter.  Even if I refresh the whole page.
>>
>> One thought I had, and maybe you could help me on this... what is the
>> process/time out to connect to a new interpreter?  I.e.
>>
>> Step 1:  Paragraph with interpreter that is not running is executed,
>> Zeppelin sees it not running and it kicks off the new JVM with the
>> interpreter
>> Step 2: Interpreter starts
>> Step 3: Zeppelin connects to the Interpreter
>>
>> I guess what is the process to go from Step 2 to Step3? Is there a delay
>> in connection? Is there a retry? I.e. If the interpreter is starting, and
>> lets set Zeppelin take 2 seconds after it starts the interpreter and tries
>> to connect.  If the interpreter isn't quite ready does it throw an error?
>> Does it retry?  Does it wait until the interpreter is 100% started before
>> trying to connect? Is there a retry?
>>
>> Given the inconsistency, I was thinking timing may be an issue.  These
>> are servers that have quite a bit going on them, thus perhaps my
>> interpreter starting is taking longer than Zeppelin would expect?
>>
>>
>>
>> On Fri, Jun 19, 2015 at 12:49 PM, moon soo Lee <m...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> Thanks for sharing the problem.
>>>
>>> Zeppelin runs each interpreter instance as a separate JVM process and
>>> communicate through thrift. Little detail is, Zeppelin server daemon invoke
>>> interpreter JVM process with specific port and server daemon connect to
>>> that port. Your error is that Zeppelin server can not connect to the
>>> interpreter JVM process. Do you see any possibility that this process can
>>> cause problem on your system?
>>>
>>> About the same variable name in markdown and hive interpreter, it won't
>>> be a problem.
>>>
>>> Thanks,
>>> moon
>>>
>>>
>>>
>>> On Fri, Jun 19, 2015 at 9:34 AM John Omernik <j...@omernik.com> wrote:
>>>
>>>> Another thing that may or may not be related is on the server running
>>>> Zeppelin, I have multiple interfaces, it "appears" the interpreter binds on
>>>> all interfaces, but what about the connection? Does that come from a
>>>> specific interface? Could that be causing the connection refused? (I have
>>>> two eth interfaces and a docker0 interface on this node)
>>>>
>>>> John
>>>>
>>>>
>>>> On Fri, Jun 19, 2015 at 8:02 AM, John Omernik <j...@omernik.com> wrote:
>>>>
>>>>> I am not an expert in Java, but could there be an issue using the
>>>>> markdown and the hive interpreters together because they share a variable
>>>>> name (md = markdown object in %markdown and md = metatdata in %hive)
>>>>>
>>>>>
>>>>>
>>>>> markdown:
>>>>>
>>>>> public void open() { md = new Markdown4jProcessor(); }
>>>>>
>>>>> hive:
>>>>>
>>>>> try { ResultSetMetaData md = res.getMetaData(); for (int i = 1; i < md
>>>>> .getColumnCount() + 1; i++) { if (i == 1) { msg.append(md.
>>>>> getColumnName(i)); } else { msg.append("\t" + md.getColumnName(i)); }
>>>>> }
>>>>>
>>>>> On Fri, Jun 19, 2015 at 6:56 AM, John Omernik <j...@omernik.com>
>>>>> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> I am working with three primary interpreters, %md, %pyspark, and
>>>>>> %hive.  What I am noticing is with my current config, sometimes an
>>>>>> interpreter will start other times, I'll get an errors below. I wish I
>>>>>> could say what the rhyme or reason was.
>>>>>>
>>>>>> If I get the errors, then I have to restart Zeppelin before it will
>>>>>> work (or even attempt to work). I've tried clicking "restart interpreter"
>>>>>> in the interpreters tab, it seems to work, but when I go back to a 
>>>>>> notebook
>>>>>> I get "Scheduler already terminated"
>>>>>>
>>>>>> What's interesting here, is other than a restart, I can run the cells
>>>>>> (I have three one for each interpreter) in different orders and get
>>>>>> different results, sometimes if I run %hive first, it works, then 
>>>>>> %pyspark,
>>>>>> that will work too then %md will fail. (Note these are the SAME commands,
>>>>>> on the same server, same config etc).
>>>>>>
>>>>>> Other times, I can get them to run no matter what, it's very
>>>>>> inconsistent, and combined with the fact that once an interpreter fails,
>>>>>> there is no getting it back until the whole server is restarted.
>>>>>>
>>>>>> Also of note here: I am running a recently compiled version of this
>>>>>> (I downloaded this on Wed) using git clone)
>>>>>>
>>>>>> Any help would be appreciated in determining how to troubleshoot this!
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Example from %md
>>>>>>
>>>>>> *In Notebook error*
>>>>>>
>>>>>>
>>>>>>
>>>>>> %md
>>>>>> #For the Love of Jeezy Pete
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135)
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249)
>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104)
>>>>>> org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202)
>>>>>> org.apache.zeppelin.scheduler.Job.run(Job.java:170)
>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296)
>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> *In Running Shell Window (where I ran bin/zeppelin.sh)*
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.thrift.transport.TTransportException: 
>>>>>> java.net.ConnectException:
>>>>>> Connection refused
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104)
>>>>>>
>>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202)
>>>>>>
>>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>>>
>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.thrift.transport.TTransportException: 
>>>>>> java.net.ConnectException:
>>>>>> Connection refused
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133)
>>>>>>
>>>>>> ... 12 more
>>>>>>
>>>>>> Caused by: org.apache.thrift.transport.TTransportException:
>>>>>> java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
>>>>>>
>>>>>> ... 19 more
>>>>>>
>>>>>> Caused by: java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>>>>>
>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>>>>>
>>>>>> at java.net.Socket.connect(Socket.java:579)
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
>>>>>>
>>>>>> ... 20 more
>>>>>>
>>>>>> *from interpreter log file:*
>>>>>>
>>>>>>  INFO [2015-06-19 06:44:29,134] ({Thread-0}
>>>>>> RemoteInterpreterServer.java[run]:95) - Starting remote interpreter 
>>>>>> server
>>>>>> on port 54930
>>>>>>
>>>>>>
>>>>>> *From Zeppelin Log file:*
>>>>>>
>>>>>>  INFO [2015-06-19 06:44:19,329] ({pool-1-thread-2}
>>>>>> SchedulerFactory.java[jobStarted]:132) - Job
>>>>>> paragraph_1434713440246_1991176208 started by scheduler
>>>>>> remoteinterpreter_328619575
>>>>>>
>>>>>>  INFO [2015-06-19 06:44:19,331] ({pool-1-thread-2}
>>>>>> Paragraph.java[jobRun]:194) - run paragraph 20150619-063040_649381067 
>>>>>> using
>>>>>> md org.apache.zeppelin.interpreter.LazyOpenInterpreter@38946f29
>>>>>>
>>>>>>  INFO [2015-06-19 06:44:19,341] ({pool-1-thread-2}
>>>>>> RemoteInterpreterProcess.java[reference]:107) - Run interpreter process
>>>>>> /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/bin/interpreter.sh
>>>>>> -d /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/interpreter/md 
>>>>>> -p
>>>>>> 54930
>>>>>>
>>>>>> ERROR [2015-06-19 06:44:24,399] ({Thread-35}
>>>>>> RemoteScheduler.java[getStatus]:226) - Can't get status information
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.thrift.transport.TTransportException: 
>>>>>> java.net.ConnectException:
>>>>>> Connection refused
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:224)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:183)
>>>>>>
>>>>>> Caused by: org.apache.thrift.transport.TTransportException:
>>>>>> java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
>>>>>>
>>>>>> ... 8 more
>>>>>>
>>>>>> Caused by: java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>>>>>
>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>>>>>
>>>>>> at java.net.Socket.connect(Socket.java:579)
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
>>>>>>
>>>>>> ... 9 more
>>>>>>
>>>>>> ERROR [2015-06-19 06:44:24,399] ({pool-1-thread-2} Job.java[run]:183)
>>>>>> - Job failed
>>>>>>
>>>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.thrift.transport.TTransportException: 
>>>>>> java.net.ConnectException:
>>>>>> Connection refused
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104)
>>>>>>
>>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202)
>>>>>>
>>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>>>
>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException:
>>>>>> org.apache.thrift.transport.TTransportException: 
>>>>>> java.net.ConnectException:
>>>>>> Connection refused
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
>>>>>>
>>>>>> at
>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133)
>>>>>>
>>>>>> ... 12 more
>>>>>>
>>>>>> Caused by: org.apache.thrift.transport.TTransportException:
>>>>>> java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
>>>>>>
>>>>>> at
>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
>>>>>>
>>>>>> ... 19 more
>>>>>>
>>>>>> Caused by: java.net.ConnectException: Connection refused
>>>>>>
>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>>>>>
>>>>>> at
>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>>>>>
>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>>>>>
>>>>>> at java.net.Socket.connect(Socket.java:579)
>>>>>>
>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
>>>>>>
>>>>>>  ... 20 more
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>

Reply via email to