Re: Spark hangs at "Removed broadcast_*"

dhruve ashar Tue, 12 Jul 2016 12:54:12 -0700

Looking at the jstack, it seems that it doesn't contain all the threads.
Cannot find the main thread in the jstack.


I am not an expert on analyzing jstacks, but are you creating any threads
in your code? Shutting them down correctly?

This one is a non-daemon and doesn't seem to be coming from Spark.
*"Scheduler-2144644334"* #110 prio=5 os_prio=0 tid=0x00007f8104001800
nid=0x715 waiting on condition [0x00007f812cf95000]

Also, does the shutdown hook get called?


On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <keyn...@gmail.com> wrote:

> Hi.
>
> Here's the last few lines before it starts removing broadcasts:
>
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003209_20886' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003209_20886: Committed
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0
> (TID 20888) in 95 ms on localhost (3209/3214)
> 16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID
> 20886). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0
> (TID 20886) in 103 ms on localhost (3210/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003208_20885' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003208_20885: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID
> 20885). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0
> (TID 20885) in 109 ms on localhost (3211/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003212_20889' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003212_20889: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID
> 20889). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0
> (TID 20889) in 84 ms on localhost (3212/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003210_20887' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003210_20887: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID
> 20887). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0
> (TID 20887) in 100 ms on localhost (3213/3214)
> 16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer
> Algorithm version is 1
> 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task
> 'attempt_201607111123_0009_m_003213_20890' to
> file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213
> 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil:
> attempt_201607111123_0009_m_003213_20890: Committed
> 16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID
> 20890). 1721 bytes result sent to driver
> 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0
> (TID 20890) in 82 ms on localhost (3214/3214)
> 16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks
> have all completed, from pool
> *16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at
> SfCountsDumper.scala:13) finished in 42.294 s*
> *16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at
> SfCountsDumper.scala:13, took 9517.124624 s*
> 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on
> 10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
> 16/07/11 14:28:46 INFO BlockManager: Removing RDD 14
> 16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14
> 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on
> 10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB)
> ...
>
> In fact, the job is still running, Spark's UI shows uptime of 20.6 hours
> with last job finishing 18 hours ago at least.
>
> On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dhruveas...@gmail.com> wrote:
>
>> Hi,
>>
>> Can you check the time when the job actually finished from the logs. The
>> logs provided are too short and do not reveal meaningful information.
>>
>>
>>
>> On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <keyn...@gmail.com>
>> wrote:
>>
>>> Spark 2.0.0-preview
>>>
>>> We've got an app that uses a fairly big broadcast variable. We run this
>>> on a
>>> big EC2 instance, so deployment is in client-mode. Broadcasted variable
>>> is a
>>> massive Map[String, Array[String]].
>>>
>>> At the end of saveAsTextFile, the output in the folder seems to be
>>> complete
>>> and correct (apart from .crc files still being there) BUT the
>>> spark-submit
>>> process is stuck on, seemingly, removing the broadcast variable. The
>>> stuck
>>> logs look like this: http://pastebin.com/wpTqvArY
>>>
>>> My last run lasted for 12 hours after after doing saveAsTextFile - just
>>> sitting there. I did a jstack on driver process, most threads are parked:
>>> http://pastebin.com/E29JKVT7
>>>
>>> Full store: We used this code with Spark 1.5.0 and it worked, but then
>>> the
>>> data changed and something stopped fitting into Kryo's serialisation
>>> buffer.
>>> Increasing it didn't help, so I had to disable the KryoSerialiser.
>>> Tested it
>>> again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
>>>
>>> I'm not quite sure what's even going on given that there's almost no CPU
>>> activity and no output in the logs, yet the output is not finalised like
>>> it
>>> used to before.
>>>
>>> Would appreciate any help, thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> -Dhruve Ashar
>>
>>


-- 
-Dhruve Ashar

Re: Spark hangs at "Removed broadcast_*"

Reply via email to