Looking at the jstack, it seems that it doesn't contain all the threads. Cannot find the main thread in the jstack.
I am not an expert on analyzing jstacks, but are you creating any threads in your code? Shutting them down correctly? This one is a non-daemon and doesn't seem to be coming from Spark. *"Scheduler-2144644334"* #110 prio=5 os_prio=0 tid=0x00007f8104001800 nid=0x715 waiting on condition [0x00007f812cf95000] Also, does the shutdown hook get called? On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <keyn...@gmail.com> wrote: > Hi. > > Here's the last few lines before it starts removing broadcasts: > > 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task > 'attempt_201607111123_0009_m_003209_20886' to > file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209 > 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: > attempt_201607111123_0009_m_003209_20886: Committed > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0 > (TID 20888) in 95 ms on localhost (3209/3214) > 16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID > 20886). 1721 bytes result sent to driver > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0 > (TID 20886) in 103 ms on localhost (3210/3214) > 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task > 'attempt_201607111123_0009_m_003208_20885' to > file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208 > 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: > attempt_201607111123_0009_m_003208_20885: Committed > 16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID > 20885). 1721 bytes result sent to driver > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0 > (TID 20885) in 109 ms on localhost (3211/3214) > 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task > 'attempt_201607111123_0009_m_003212_20889' to > file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212 > 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: > attempt_201607111123_0009_m_003212_20889: Committed > 16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID > 20889). 1721 bytes result sent to driver > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0 > (TID 20889) in 84 ms on localhost (3212/3214) > 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task > 'attempt_201607111123_0009_m_003210_20887' to > file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210 > 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: > attempt_201607111123_0009_m_003210_20887: Committed > 16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID > 20887). 1721 bytes result sent to driver > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0 > (TID 20887) in 100 ms on localhost (3213/3214) > 16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer > Algorithm version is 1 > 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task > 'attempt_201607111123_0009_m_003213_20890' to > file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213 > 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: > attempt_201607111123_0009_m_003213_20890: Committed > 16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID > 20890). 1721 bytes result sent to driver > 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0 > (TID 20890) in 82 ms on localhost (3214/3214) > 16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks > have all completed, from pool > *16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at > SfCountsDumper.scala:13) finished in 42.294 s* > *16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at > SfCountsDumper.scala:13, took 9517.124624 s* > 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on > 10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB) > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1 > 16/07/11 14:28:46 INFO BlockManager: Removing RDD 14 > 16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14 > 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on > 10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB) > ... > > In fact, the job is still running, Spark's UI shows uptime of 20.6 hours > with last job finishing 18 hours ago at least. > > On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dhruveas...@gmail.com> wrote: > >> Hi, >> >> Can you check the time when the job actually finished from the logs. The >> logs provided are too short and do not reveal meaningful information. >> >> >> >> On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <keyn...@gmail.com> >> wrote: >> >>> Spark 2.0.0-preview >>> >>> We've got an app that uses a fairly big broadcast variable. We run this >>> on a >>> big EC2 instance, so deployment is in client-mode. Broadcasted variable >>> is a >>> massive Map[String, Array[String]]. >>> >>> At the end of saveAsTextFile, the output in the folder seems to be >>> complete >>> and correct (apart from .crc files still being there) BUT the >>> spark-submit >>> process is stuck on, seemingly, removing the broadcast variable. The >>> stuck >>> logs look like this: http://pastebin.com/wpTqvArY >>> >>> My last run lasted for 12 hours after after doing saveAsTextFile - just >>> sitting there. I did a jstack on driver process, most threads are parked: >>> http://pastebin.com/E29JKVT7 >>> >>> Full store: We used this code with Spark 1.5.0 and it worked, but then >>> the >>> data changed and something stopped fitting into Kryo's serialisation >>> buffer. >>> Increasing it didn't help, so I had to disable the KryoSerialiser. >>> Tested it >>> again - it hanged. Switched to 2.0.0-preview - seems like the same issue. >>> >>> I'm not quite sure what's even going on given that there's almost no CPU >>> activity and no output in the logs, yet the output is not finalised like >>> it >>> used to before. >>> >>> Would appreciate any help, thanks >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> >> -- >> -Dhruve Ashar >> >> -- -Dhruve Ashar