Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be?
On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen <v...@paxata.com> wrote: > Love to hear some input on this. I did get a standalone cluster up on my > local machine and the problem didn't present itself. I'm pretty confident > that means the problem is in the LocalBackend or something near it. > > On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen <v...@paxata.com> > wrote: > >> Okay I confirmed my suspicions of a hang. I made a request that stopped >> progressing, though the already-scheduled tasks had finished. I made a >> separate request that was small enough not to hang, and it kicked the hung >> job enough to finish. I think what's happening is that the scheduler or the >> local backend is not kicking the revive offers messaging at the right time, >> but I have to dig into the code some more to nail the culprit. Anyone on >> these list have experience in those code areas that could help? >> >> On Thu, Feb 26, 2015 at 2:27 AM, Victor Tso-Guillen <v...@paxata.com> >> wrote: >> >>> Thanks for the link. Unfortunately, I turned on rdd compression and >>> nothing changed. I tried moving netty -> nio and no change :( >>> >>> On Thu, Feb 26, 2015 at 2:01 AM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Not many that i know of, but i bumped into this one >>>> https://issues.apache.org/jira/browse/SPARK-4516 >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Thu, Feb 26, 2015 at 3:26 PM, Victor Tso-Guillen <v...@paxata.com> >>>> wrote: >>>> >>>>> Is there any potential problem from 1.1.1 to 1.2.1 with shuffle >>>>> dependencies that produce no data? >>>>> >>>>> On Thu, Feb 26, 2015 at 1:56 AM, Victor Tso-Guillen <v...@paxata.com> >>>>> wrote: >>>>> >>>>>> The data is small. The job is composed of many small stages. >>>>>> >>>>>> * I found that with fewer than 222 the problem exhibits. What will be >>>>>> gained by going higher? >>>>>> * Pushing up the parallelism only pushes up the boundary at which the >>>>>> system appears to hang. I'm worried about some sort of message loss or >>>>>> inconsistency. >>>>>> * Yes, we are using Kryo. >>>>>> * I'll try that, but I'm again a little confused why you're >>>>>> recommending this. I'm stumped so might as well? >>>>>> >>>>>> On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das < >>>>>> ak...@sigmoidanalytics.com> wrote: >>>>>> >>>>>>> What operation are you trying to do and how big is the data that you >>>>>>> are operating on? >>>>>>> >>>>>>> Here's a few things which you can try: >>>>>>> >>>>>>> - Repartition the RDD to a higher number than 222 >>>>>>> - Specify the master as local[*] or local[10] >>>>>>> - Use Kryo Serializer (.set("spark.serializer", >>>>>>> "org.apache.spark.serializer.KryoSerializer")) >>>>>>> - Enable RDD Compression (.set("spark.rdd.compress","true") ) >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Best Regards >>>>>>> >>>>>>> On Thu, Feb 26, 2015 at 10:15 AM, Victor Tso-Guillen < >>>>>>> v...@paxata.com> wrote: >>>>>>> >>>>>>>> I'm getting this really reliably on Spark 1.2.1. Basically I'm in >>>>>>>> local mode with parallelism at 8. I have 222 tasks and I never seem to >>>>>>>> get >>>>>>>> far past 40. Usually in the 20s to 30s it will just hang. The last >>>>>>>> logging >>>>>>>> is below, and a screenshot of the UI. >>>>>>>> >>>>>>>> 2015-02-25 20:39:55.779 GMT-0800 INFO [task-result-getter-3] >>>>>>>> TaskSetManager - Finished task 3.0 in stage 16.0 (TID 22) in 612 ms on >>>>>>>> localhost (1/5) >>>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO [Executor task launch >>>>>>>> worker-10] Executor - Finished task 1.0 in stage 16.0 (TID 20). 2492 >>>>>>>> bytes >>>>>>>> result sent to driver >>>>>>>> 2015-02-25 20:39:55.825 GMT-0800 INFO [Executor task launch >>>>>>>> worker-8] Executor - Finished task 2.0 in stage 16.0 (TID 21). 2492 >>>>>>>> bytes >>>>>>>> result sent to driver >>>>>>>> 2015-02-25 20:39:55.831 GMT-0800 INFO [task-result-getter-0] >>>>>>>> TaskSetManager - Finished task 1.0 in stage 16.0 (TID 20) in 670 ms on >>>>>>>> localhost (2/5) >>>>>>>> 2015-02-25 20:39:55.836 GMT-0800 INFO [task-result-getter-1] >>>>>>>> TaskSetManager - Finished task 2.0 in stage 16.0 (TID 21) in 674 ms on >>>>>>>> localhost (3/5) >>>>>>>> 2015-02-25 20:39:55.891 GMT-0800 INFO [Executor task launch >>>>>>>> worker-9] Executor - Finished task 0.0 in stage 16.0 (TID 19). 2492 >>>>>>>> bytes >>>>>>>> result sent to driver >>>>>>>> 2015-02-25 20:39:55.896 GMT-0800 INFO [task-result-getter-2] >>>>>>>> TaskSetManager - Finished task 0.0 in stage 16.0 (TID 19) in 740 ms on >>>>>>>> localhost (4/5) >>>>>>>> >>>>>>>> [image: Inline image 1] >>>>>>>> What should I make of this? Where do I start? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Victor >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >