[ https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383437#comment-15383437 ]
Kevin Zhang commented on SPARK-13514: ------------------------------------- yes file:// does exsit and this is definitely a work around. But I what I mean is that is there other ways to solve the problem except for removing file:// prefix, because in our environment the prefix is needed for some reason. > Spark Shuffle Service 1.6.0 issue in Yarn > ------------------------------------------ > > Key: SPARK-13514 > URL: https://issues.apache.org/jira/browse/SPARK-13514 > Project: Spark > Issue Type: Bug > Reporter: Satish Kolli > > Spark shuffle service 1.6.0 in Yarn fails with an unknown exception. When I > replace the spark shuffle jar with version 1.5.2 jar file, the following > succeeds with out any issues. > Hadoop Version: 2.5.1 (Kerberos Enabled) > Spark Version: 1.6.0 > Java Version: 1.7.0_79 > {code} > $SPARK_HOME/bin/spark-shell \ > --master yarn \ > --deploy-mode client \ > --conf spark.dynamicAllocation.enabled=true \ > --conf spark.dynamicAllocation.minExecutors=5 \ > --conf spark.yarn.executor.memoryOverhead=2048 \ > --conf spark.shuffle.service.enabled=true \ > --conf spark.scheduler.mode=FAIR \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --executor-memory 6G \ > --driver-memory 8G > {code} > {code} > scala> val df = sc.parallelize(1 to 50).toDF > df: org.apache.spark.sql.DataFrame = [_1: int] > scala> df.show(50) > {code} > {code} > 16/02/26 08:20:53 INFO spark.SparkContext: Starting job: show at <console>:30 > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Got job 0 (show at > <console>:30) with 1 output partitions > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 > (show at <console>:30) > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Missing parents: List() > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0 > (MapPartitionsRDD[2] at show at <console>:30), which has no missing parents > 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0 stored as > values in memory (estimated size 2.2 KB, free 2.2 KB) > 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0_piece0 stored > as bytes in memory (estimated size 1411.0 B, free 3.6 KB) > 16/02/26 08:20:53 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in > memory on 10.5.76.106:46683 (size: 1411.0 B, free: 5.5 GB) > 16/02/26 08:20:53 INFO spark.SparkContext: Created broadcast 0 from broadcast > at DAGScheduler.scala:1006 > 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 0 (MapPartitionsRDD[2] at show at <console>:30) > 16/02/26 08:20:53 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks > 16/02/26 08:20:53 INFO scheduler.FairSchedulableBuilder: Added task set > TaskSet_0 tasks to pool default > 16/02/26 08:20:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 0.0 (TID 0, XXXXXXXXXXXXXXXXXXXXXXXX, partition 0,PROCESS_LOCAL, 2031 bytes) > 16/02/26 08:20:53 INFO cluster.YarnClientSchedulerBackend: Disabling executor > 2. > 16/02/26 08:20:54 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 0) > 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Trying to remove > executor 2 from BlockManagerMaster. > 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Removing block > manager BlockManagerId(2, XXXXXXXXXXXXXXXXXXXXXXXX, 48113) > 16/02/26 08:20:54 INFO storage.BlockManagerMaster: Removed 2 successfully in > removeExecutor > 16/02/26 08:20:54 ERROR cluster.YarnScheduler: Lost executor 2 on > XXXXXXXXXXXXXXXXXXXXXXXX: Container marked as failed: > container_1456492687549_0001_01_000003 on host: XXXXXXXXXXXXXXXXXXXXXXXX. > Exit status: 1. Diagnostics: Exception from container-launch: > ExitCodeException exitCode=1: > ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org