[jira] [Commented] (KUDU-3099) KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with exitCode: 16

ASF subversion and git services (Jira) Mon, 06 Apr 2020 12:12:10 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076589#comment-17076589
 ]


ASF subversion and git services commented on KUDU-3099:
-------------------------------------------------------

Commit 5432d316a1c0e012b247b0c83ce86a481a4597dc in kudu's branch 
refs/heads/master from waleed
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=5432d31 ]

KUDU-3099: Remove System.exit() calls from KuduBackup/KuduRestore

The System.exit() calls have a side effect that can cause Spark to fail even
if the run function returns 0 on success. Rather than call System.exit()
the run() method will return true on a successful run. We then throw
a RuntimeException() in main if we find that run() failed, otherwise
we call SparkSession's stop() method to cleanly shutdown Spark.
Unfortunately the issue isn't easy to reproduce but we had one
environment exhibiting the problem and we confirmed that this patch
fixes the issue. TestKuduBackup.scala was modified where assertFalse()
is used to check for failure and assertTrue() for success.

Change-Id: I7d1b4796b6280adecd7dab685a0281af6b2570ce
Reviewed-on: http://gerrit.cloudera.org:8080/15638
Tested-by: Grant Henke <[email protected]>
Reviewed-by: Grant Henke <[email protected]>


> KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with 
> exitCode: 16
> ----------------------------------------------------------------------------------------
>
>                 Key: KUDU-3099
>                 URL: https://issues.apache.org/jira/browse/KUDU-3099
>             Project: Kudu
>          Issue Type: Bug
>          Components: backup, spark
>    Affects Versions: 1.10.0, 1.11.0
>            Reporter: Waleed Fateem
>            Assignee: Waleed Fateem
>            Priority: Major
>
> When running KuduBackup/KuduRestore the underlying Spark application can fail 
> when running on YARN even when the backup/restore tasks complete 
> successfully. The following was from the Spark driver log:
> {code:java}
> INFO spark.SparkContext: Submitted application: Kudu Table Backup
> ..
> INFO spark.SparkContext: Starting job: save at KuduBackup.scala:90
> INFO scheduler.DAGScheduler: Got job 0 (save at KuduBackup.scala:90) with 200 
> output partitions
> scheduler.DAGScheduler: Final stage: ResultStage 0 (save at 
> KuduBackup.scala:90)
> ..
> INFO scheduler.DAGScheduler: Submitting 200 missing tasks from ResultStage 0 
> (MapPartitionsRDD[2] at save at KuduBackup.scala:90) (first 15 tasks are for 
> partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
> INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 200 tasks
> ..
> INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all 
> completed, from pool 
> INFO scheduler.DAGScheduler: Job 0 finished: save at KuduBackup.scala:90, 
> took 20.007488 s
> ..
> INFO spark.SparkContext: Invoking stop() from shutdown hook
> ..
> INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
> ..
> INFO spark.SparkContext: Successfully stopped SparkContext
> INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: 
> Shutdown hook called before final status was reported.)
> INFO util.ShutdownHookManager: Shutdown hook called{code}
> Spark explicitly added this shutdown hook to catch System.exit() calls and in 
> case this occurs before the SparkContext stops then the application status is 
> considered a failure:
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L299]
> The System.exit() call added as part of KUDU-2787 can cause this race 
> condition and that was merged in the 1.10.x and 1.11.x branches. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KUDU-3099) KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with exitCode: 16

Reply via email to