Re: All PySpark jobs are canceled when one user cancel his PySpark paragraph (job)

Jeff Zhang Tue, 12 Jun 2018 19:25:20 -0700

This is a limitation of the native PySparkInterpreter.

Two solutions for you.
1. Use per user scoped mode so that each user own his own python process
2. Use IPySparkInterpreter of zeppelin 0.8 which is better for integration
python with zeppelin.




Jhon Anderson Cardenas Diaz <[email protected]>于2018年6月13日周三
上午6:15写道：

> Hi!
>
> We found the reason why this error is happening. It seems to be related
> with the solution
> <
> https://github.com/apache/zeppelin/commit/9f22db91c279b7daf6a13b2d805a874074b070fd
> >
> for the task ZEPPELIN-2075
> <https://issues.apache.org/jira/browse/ZEPPELIN-2075>.
>
> This solution is causing that when one particular user cancels his py-spark
> job, the py-spark jobs from *all the users are being canceled !!*.
>
> When a py-spark job is cancelled, the method PySparkInterpreter interrupt()
> is invoked, and then the SIGINT event is called, causing that all the jobs
> in the same spark context be cancelled:
>
> context.py:
>
> # create a signal handler which would be invoked on receiving SIGINT
> def signal_handler(signal, frame):
>     *self.cancelAllJobs()*
>     raise KeyboardInterrupt()
>
> Is this a zeppelin bug ?
>
> Thank you.
>
>
> 2018-06-12 17:12 GMT-05:00 Jhon Anderson Cardenas Diaz <
> [email protected]>:
>
> > Hi!
> >
> > We found the reason why this error is happening. It seems to be related
> > with the solution
> > <
> https://github.com/apache/zeppelin/commit/9f22db91c279b7daf6a13b2d805a874074b070fd
> >
> > for the task ZEPPELIN-2075
> > <https://issues.apache.org/jira/browse/ZEPPELIN-2075>.
> >
> > This solution is causing that when one particular user cancels his
> > py-spark job, the py-spark jobs from all the users are being canceled.
> >
> > When a py-spark job is cancelled, the method PySparkInterpreter
> > interrupt() is invoked, and then the SIGINT
> >
> > context.py:
> >
> > # create a signal handler which would be invoked on receiving SIGINT
> > def signal_handler(signal, frame):
> >     self.cancelAllJobs()
> >     raise KeyboardInterrupt()
> >
> >
> > 2018-06-12 9:26 GMT-05:00 Jhon Anderson Cardenas Diaz <
> > [email protected]>:
> >
> >> Hi!.
> >> I have 0.8.0 version, from September  2017
> >>
> >> 2018-06-12 4:48 GMT-05:00 Jianfeng (Jeff) Zhang <[email protected]
> >:
> >>
> >>>
> >>> Which version do you use ?
> >>>
> >>>
> >>> Best Regard,
> >>> Jeff Zhang
> >>>
> >>>
> >>> From: Jhon Anderson Cardenas Diaz <[email protected]<mailto:
> >>> [email protected]>>
> >>> Reply-To: "[email protected]<mailto:[email protected]
> >"
> >>> <[email protected]<mailto:[email protected]>>
> >>> Date: Friday, June 8, 2018 at 11:08 PM
> >>> To: "[email protected]<mailto:[email protected]>" <
> >>> [email protected]<mailto:[email protected]>>, "
> >>> [email protected]<mailto:[email protected]>" <
> >>> [email protected]<mailto:[email protected]>>
> >>> Subject: All PySpark jobs are canceled when one user cancel his PySpark
> >>> paragraph (job)
> >>>
> >>> Dear community,
> >>>
> >>> Currently we are having problems with multiple users running paragraphs
> >>> associated with pyspark jobs.
> >>>
> >>> The problem is that if an user aborts/cancels his pyspark paragraph
> >>> (job), the active pyspark jobs of the other users are canceled too.
> >>>
> >>> Going into detail, I've seen that when you cancel a user's job this
> >>> method is invoked (which is fine):
> >>>
> >>> sc.cancelJobGroup("zeppelin-[notebook-id]-[paragraph-id]")
> >>>
> >>> But somehow unknown to me, this method is also invoked:
> >>>
> >>> sc.cancelAllJobs()
> >>>
> >>> The above is due to the trace of the log that appears in the jobs of
> the
> >>> other users:
> >>>
> >>> Py4JJavaError: An error occurred while calling o885.count.
> >>> : org.apache.spark.SparkException: Job 461 cancelled as part of
> >>> cancellation of all jobs
> >>> at org.apache.spark.scheduler.DAGScheduler.org<http://org.apach
> >>> e.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$D
> >>> AGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> >>> at org.apache.spark.scheduler.DAGScheduler.handleJobCancellatio
> >>> n(DAGScheduler.scala:1375)
> >>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
> >>> Jobs$1.apply$mcVI$sp(DAGScheduler.scala:721)
> >>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
> >>> Jobs$1.apply(DAGScheduler.scala:721)
> >>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
> >>> Jobs$1.apply(DAGScheduler.scala:721)
> >>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> >>> at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGS
> >>> cheduler.scala:721)
> >>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOn
> >>> Receive(DAGScheduler.scala:1628)
> >>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
> >>> ceive(DAGScheduler.scala:1605)
> >>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
> >>> ceive(DAGScheduler.scala:1594)
> >>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> >>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.
> >>> scala:628)
> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
> >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
> >>> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
> >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
> >>> onScope.scala:151)
> >>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
> >>> onScope.scala:112)
> >>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> >>> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
> >>> at org.apache.spark.sql.execution.SparkPlan.executeCollect(Spar
> >>> kPlan.scala:275)
> >>> at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D
> >>> ataset$$execute$1$1.apply(Dataset.scala:2386)
> >>> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
> >>> nId(SQLExecution.scala:57)
> >>> at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)
> >>> at org.apache.spark.sql.Dataset.org<http://org.apache.spark.sql
> >>> .Dataset.org>$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385)
> >>> at org.apache.spark.sql.Dataset.org<http://org.apache.spark.sql
> >>> .Dataset.org>$apache$spark$sql$Dataset$$collect(Dataset.scala:2392)
> >>> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
> >>> scala:2420)
> >>> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
> >>> scala:2419)
> >>> at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801)
> >>> at org.apache.spark.sql.Dataset.count(Dataset.scala:2419)
> >>> at sun.reflect.GeneratedMethodAccessor120.invoke(Unknown Source)
> >>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> >>> thodAccessorImpl.java:43)
> >>> at java.lang.reflect.Method.invoke(Method.java:498)
> >>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> >>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> >>> at py4j.Gateway.invoke(Gateway.java:280)
> >>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> >>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> >>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> >>> at java.lang.Thread.run(Thread.java:748)
> >>>
> >>> (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error
> >>> occurred while calling o885.count.\n', JavaObject id=o886), <traceback
> >>> object at 0x7f9e669ae588>)
> >>>
> >>> Any idea of why this could be happening?
> >>>
> >>> (I have 0.8.0 version from September 2017)
> >>>
> >>> Thank you!
> >>>
> >>
> >>
> >
>

Re: All PySpark jobs are canceled when one user cancel his PySpark paragraph (job)

Reply via email to