I opened a JIRA with all the details (logs etc): https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZEPPELIN-2515
Thank you Pietro Pugni Il 9 mag 2017 7:48 PM, "Jongyoul Lee" <jongy...@gmail.com> ha scritto: Hi, Thanks for this detail debugging. At first, notebookserver doesn't have any clue for this symptom because it's used between browser and zeppelin server. I don't know why R has stoped unexpectedly. Is there any log related to R? I'm not familiar with R actually. BTW, I'll install R and test it in my local On Tue, May 9, 2017 at 8:29 AM, Pietro Pugni <pietro.pu...@gmail.com> wrote: > I repost this because it didn’t appear on the mailing list board. > > These are the step needed to reproduce the error and to track down the log > message. > > 1) I started a brand new instance of zeppelin issuing: > service zeppelin start > > and started a bash script that tracks down R processes activity. > After running a simple R script from Zeppelin, the R interpreter process > was started: > > Mon May 8 11:27:59 CEST 2017 >>> R started > > 2) I left the browser open and at 12:26:15 I closed the browser. Zeppelin > tracked down the connection being closed: > INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} > NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : > 33798. (1001) null > > 3) At 13:08:00 R was closed. My script returned: > Mon May 8 13:08:00 CEST 2017 >>> R stopped > > This is the output from the interpreter log file (deleted non-useful > lines): > INFO [2017-05-08 11:27:43,632] ({Thread-0} > RemoteInterpreterServer.java[run]:95) > - Starting remote interpreter server on port 45227 > INFO [2017-05-08 11:27:44,600] ({pool-1-thread-3} > RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate > interpreter org.apache.zeppelin.spark.SparkInterpreter > INFO [2017-05-08 11:27:44,624] ({pool-1-thread-3} > RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate > interpreter org.apache.zeppelin.spark.SparkSqlInterpreter > INFO [2017-05-08 11:27:44,629] ({pool-1-thread-3} > RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate > interpreter org.apache.zeppelin.spark.DepInterpreter > INFO [2017-05-08 11:27:44,640] ({pool-1-thread-3} > RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate > interpreter org.apache.zeppelin.spark.PySparkInterpreter > INFO [2017-05-08 11:27:44,643] ({pool-1-thread-3} > RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate > interpreter org.apache.zeppelin.spark.SparkRInterpreter > ... > INFO [2017-05-08 11:28:00,188] ({pool-2-thread-2} > SchedulerFactory.java[jobFinished]:137) - Job > remoteInterpretJob_1494235664723 finished by scheduler > org.apache.zeppelin.spark.SparkRInterpreter2097894179 > DEBUG [2017-05-08 11:28:00,819] ({pool-1-thread-3} > RemoteInterpreterServer.java[resourcePoolGetAll]:911) - Request getAll > from ZeppelinServer > *DEBUG [2017-05-08 13:08:00,187] ({Exec Stream Pumper} > InterpreterOutputStream.java[processLine]:72) - Interpreter output:Error in > handleErrors(returnStatus, conn) : * > *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} > InterpreterOutputStream.java[processLine]:72) - Interpreter output: No > status is returned. Java SparkR backend might have failed.* > *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} > InterpreterOutputStream.java[processLine]:72) - Interpreter output:Calls: > <Anonymous> -> invokeJava -> handleErrors* > *DEBUG [2017-05-08 13:08:00,188] ({Exec Stream Pumper} > InterpreterOutputStream.java[processLine]:72) - Interpreter > output:Execution halted* > > This is the output from zeppelin log file (it didn't track the R > interpreter failure): > INFO [2017-05-08 11:28:00,221] ({pool-2-thread-2} > NotebookServer.java[afterStatusChange]:2056) - Job > 20170506-145151_1585482989 is finished successfully, status: FINISHED > INFO [2017-05-08 11:28:00,675] ({pool-2-thread-2} > SchedulerFactory.java[jobFinished]:137) - Job paragraph_1494075111996_- > 1250116940 finished by scheduler org.apache.zeppelin.interprete > r.remote.RemoteInterpretershared_session2130846287 > *INFO [2017-05-08 12:26:15,879] ({qtp423031029-60} > NotebookServer.java[onClose]:363) - Closed connection to 127.0.0.1 : 33798. > (1001) null* > INFO [2017-05-08 12:27:12,126] ({Thread-33} > AbstractValidatingSessionManager.java[validateSessions]:271) > - Validating all active sessions... > INFO [2017-05-08 12:27:12,126] ({Thread-33} > AbstractValidatingSessionManager.java[validateSessions]:304) > - Finished session validation. No sessions were stopped. > > Hope this helps. > Any hints? > > Il giorno 08 mag 2017, alle ore 11:08, Pietro Pugni < > pietro.pu...@gmail.com> ha scritto: > > I know for sure that R process gets killed (or quits) but don't know if > its father process (interpreter.sh) gets killed too. > > I noticed that I can always restart the interpreter on 0.7.1 while > sometimes it was impossible to do on 0.7.0 (I had to manually restart > zeppelin service). Probably that JIRA improved the situation a little bit. > > Now I'm running a bash script that tracks start and stop time of R process > in order to shed some light on this issue. I enabled DEBUG logging in log4j > properties file. > > > Il 6 mag 2017 4:43 PM, "Paul Brenner" <pbren...@placeiq.com> ha scritto: > >> Great work documenting repeatable steps for this hard to nail down >> problem. I see similar problems running the spark (scala) interpreter but >> haven’t been as systematic about hunting down the issue as you. >> >> I do wonder if this is related somehow to https://issues.apache.org/j >> ira/browse/ZEPPELIN-1832 >> <https://share.polymail.io/v1/z/b/NTkwZGRlMzNiZmFi/Go00wlomvjABQNciq78PfdeRmR4K6c4M5l8KsTYGlks2sD4oe9jS7NYIkVZ2KKlntmyN0z2ZbiIFSP59SQpYL0hq_V6k3ZjCvIj_gDhLCD5s9K74YEQl1S5xOyCx0TK-xuhEd59t3p3nhZrhs1xXLJxUEM6PoX1EWAcJswdLQj6oNrNLeE-0uF9D4DZjlMlBWs_aYKvi14I21deKenrCDUCPJccm> >> which just seems to have addressed killing off zombie processes but I’m >> not sure it covered where zombie processes are coming from. Perhaps we need >> to open a ticket for this? >> >> In the mean time if you don’t have the ability to restart zeppelin every >> time you run into this process you can probably just kill the interpreter >> process. I find myself doing that multiple times in an normal work day. >> >> <http://www.placeiq.com/> <http://www.placeiq.com/> >> <http://www.placeiq.com/> Paul Brenner <https://twitter.com/placeiq> >> <https://twitter.com/placeiq> <https://twitter.com/placeiq> >> <https://www.facebook.com/PlaceIQ> <https://www.facebook.com/PlaceIQ> >> <https://www.linkedin.com/company/placeiq> >> <https://www.linkedin.com/company/placeiq> >> DATA SCIENTIST >> *(217) 390-3033 <(217)%20390-3033> * >> >> <http://www.placeiq.com/2015/05/26/placeiq-named-winner-of-prestigious-2015-oracle-data-cloud-activate-award/> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> <http://placeiq.com/2015/12/18/accuracy-vs-precision-in-location-data-mma-webinar/> >> <http://placeiq.com/2016/03/08/measuring-addressable-tv-campaigns-is-now-possible/> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> <http://placeiq.com/2016/04/13/placeiq-joins-the-network-advertising-initiative-nai-as-100th-member/> >> <http://pages.placeiq.com/Location-Data-Accuracy-Whitepaper-Download.html?utm_source=Signature&utm_medium=Email&utm_campaign=AccuracyWP> >> <http://placeiq.com/2016/08/03/placeiq-bolsters-location-intelligence-platform-with-mastercard-insights/> >> <http://placeiq.com/2016/10/26/the-making-of-a-location-data-industry-milestone/>[image: >> PlaceIQ:Location Data Accuracy] >> <http://placeiq.com/2016/12/07/placeiq-introduces-landmark-a-groundbreaking-offering-that-delivers-access-to-the-highest-quality-location-data-for-insights-that-fuel-limitless-business-decisions/> >> >> On Sat, May 06, 2017 at 6:47 AM Pietro Pugni <Pietro Pugni >> <pietro+pugni+%3cpietro.pu...@gmail.com%3E>> wrote: >> >>> Hi all, >>> I am facing a strange issue on two different machines that acts like >>> servers. Each of them runs an instance of Zeppelin installed as a system.d >>> service. >>> The configuration is: >>> - Ubuntu Server 16.04.2 LTS >>> - Spark 2.1.0 >>> - Microsoft Open R 3.3.2 >>> - Zeppelin 0.7.1 (0.7.0 gave the same problems) >>> >>> zeppelin-env.sh has the following settings: >>> export SPARK_HOME="/spark/home/directory" >>> >>> spark-env.sh has the following settings: >>> export LANG="en_US" >>> export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/some/dir >>> -Dspark.eventLog.dir=/some/dir/spark-events -Dhadoop.tmp.dir=/some/dir" >>> export _JAVA_OPTIONS+=" -Djava.io.tmpdir=/some/dir" >>> >>> spark-defaults.conf is set as: >>> spark.executor.memory 21g >>> spark.driver.memory 21g >>> spark.python.worker.memory 4g >>> spark.sql.autoBroadcastJoinThreshold 0 >>> >>> I use Spark in stand-alone mode and it works perfectly. It also works >>> correctly with Zeppelin but this is what happens: >>> 1) Start zeppelin on the server using the command service zeppelin start >>> 2) Connect to port 8080 using Mozilla Firefox from client >>> 3) Insert username and password (I enabled Shiro authentication) >>> 4) open a notebook >>> 5) Execute the following code: >>> %spark.r >>> 2+2 >>> 6) The code runs correctly and I can see that R is currently running as >>> a process. >>> 7) Repeat steps 2-5 after some time (let’s say 2 or 3 hours) and >>> Zeppelin remains forever on “Running” or, if the elapsed time is higher >>> (for example 1 day) since the last run, it returns “Error”. The >>> “time-to-be-unresponsive” seems to be random and unpredictable. Also, R is >>> not present in the list of running processes. Spark session remains active >>> because I can access Spark UI from port 4040 and the application name is >>> “Zeppelin”, so it’s the Spark instance created by Zeppelin. >>> >>> I observed that sometimes I can simply restart the interpreter from >>> Zeppelin UI, but many other times it doesn’t work and I have to restart >>> Zeppelin ( service zeppelin restart ). >>> >>> This issue afflicts both 0.7.0 and 0.7.1 but I haven’t tried with >>> previous versions. It also happens if Zeppelin isn’t installed as a service. >>> >>> I can’t provide more detail because I can’t see any error or warning in >>> the logs.. this is really strange. >>> >>> Thank you all. >>> Kind regards >>> Pietro Pugni >>> >> >> > > -- 이종열, Jongyoul Lee, 李宗烈 http://madeng.net