Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

Ayman Farahat Sat, 27 Jun 2015 21:33:08 -0700

Where do I do that ? 
Thanks 

Sent from my iPhone


> On Jun 27, 2015, at 8:59 PM, Sabarish Sasidharan 
> <sabarish.sasidha...@manthan.com> wrote:
> 
> Try setting the yarn executor memory overhead to a higher value like 1g or 
> 1.5g or more.
> 
> Regards
> Sab
> 
>> On 28-Jun-2015 9:22 am, "Ayman Farahat" <ayman.fara...@yahoo.com> wrote:
>> That's correct this is Yarn
>> And spark 1.4
>> Also using the Anaconda tar for Numpy and other Libs
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Jun 27, 2015, at 8:50 PM, Sabarish Sasidharan 
>>> <sabarish.sasidha...@manthan.com> wrote:
>>> 
>>> Are you running on top of YARN? Plus pls provide your infrastructure 
>>> details.
>>> 
>>> Regards
>>> Sab
>>> 
>>>> On 28-Jun-2015 8:47 am, "Ayman Farahat" <ayman.fara...@yahoo.com.invalid> 
>>>> wrote:
>>>> Hello; 
>>>> I tried to adjust the number of blocks by repartitioning the input. 
>>>> Here is How I do it;  (I am partitioning by users )
>>>> 
>>>> tot = newrdd.map(lambda l: 
>>>> (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
>>>> ratings = tot.values()
>>>> numIterations =8
>>>> rank = 80
>>>> model = ALS.trainImplicit(ratings, rank, numIterations)
>>>> 
>>>> 
>>>> I have 20 executors
>>>> with 5GM memory per executor. 
>>>> When i use 80 factors I keep getting the following problem :
>>>> 
>>>> Traceback (most recent call last):
>>>>   File "/homes/afarahat/myspark/mm/df4test.py", line 85, in <module>
>>>>     model = ALS.trainImplicit(ratings, rank, numIterations)
>>>>   File 
>>>> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py",
>>>>  line 201, in trainImplicit
>>>>   File 
>>>> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
>>>>  line 128, in callMLlibFunc
>>>>   File 
>>>> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
>>>>  line 121, in callJavaFunc
>>>>   File 
>>>> "/homes/afarahat/aofspark/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>>>  line 538, in __call__
>>>>   File 
>>>> "/homes/afarahat/aofspark/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>>>  line 300, in get_return_value
>>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>>> o113.trainImplicitALSModel.
>>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
>>>> 7 in stage 36.1 failed 4 times, most recent failure: Lost task 7.3 in 
>>>> stage 36.1 (TID 1841, gsbl52746.blue.ygrid.yahoo.com): 
>>>> java.io.FileNotFoundException: 
>>>> /grid/3/tmp/yarn-local/usercache/afarahat/appcache/application_1433921068880_1027774/blockmgr-0e518470-57d8-472f-8fba-3b593e4dda42/27/rdd_56_24
>>>>  (No such file or directory)
>>>>         at java.io.RandomAccessFile.open(Native Method)
>>>>         at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
>>>>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
>>>>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>>>>         at 
>>>> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
>>>>         at 
>>>> org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)
>>>>         at 
>>>> org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)
>>>>         at 
>>>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>>>>         at 
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>>>         at 
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>>>         at 
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>>         at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>         at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>         at java.lang.Thread.run(Thread.java:722)
>>>> 
>>>> Driver stacktrace:
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
>>>>         at 
>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>         at 
>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>>>         at scala.Option.foreach(Option.scala:236)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
>>>>         at 
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
>>>>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>>> 
>>>> Jun 28, 2015 2:10:37 AM INFO: parquet.hadoop.ParquetFileReader: Initiating 
>>>> action with parallelism: 5
>>>> ~                                                                          
>>>>                                                                     
>>>>> On Jun 26, 2015, at 12:33 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>>> 
>>>>> So you have 100 partitions (blocks). This might be too many for your 
>>>>> dataset. Try setting a smaller number of blocks, e.g., 32 or 64. When ALS 
>>>>> starts iterations, you can see the shuffle read/write size from the 
>>>>> "stages" tab of Spark WebUI. Vary number of blocks and check the numbers 
>>>>> there. Kyro serializer doesn't help much here. You can try disabling it 
>>>>> (though I don't think it caused the failure). -Xiangrui
>>>>> 
>>>>>> On Fri, Jun 26, 2015 at 11:00 AM, Ayman Farahat 
>>>>>> <ayman.fara...@yahoo.com> wrote:
>>>>>> Hello ; 
>>>>>> I checked on my partitions/storage and here is what I have 
>>>>>> 
>>>>>> I have 80 executors
>>>>>> 5 G per executore. 
>>>>>> 
>>>>>> Do i need to set additional params
>>>>>> say cores
>>>>>> 
>>>>>> spark.serializer                 
>>>>>> org.apache.spark.serializer.KryoSerializer
>>>>>> # spark.driver.memory              5g
>>>>>> # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
>>>>>> -Dnumbers="one two three"
>>>>>> spark.shuffle.memoryFraction  0.3
>>>>>> spark.storage.memoryFraction  0.65
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> RDD Name Storage Level   Cached Partitions       Fraction Cached Size in 
>>>>>> Memory  Size in Tachyon Size on Disk
>>>>>> ratingBlocks     Memory Deserialized 1x Replicated       257     129%    
>>>>>> 4.1 GB  0.0 B   0.0 B
>>>>>> itemOutBlocks    Memory Deserialized 1x Replicated       100     100%    
>>>>>> 7.3 MB  0.0 B   0.0 B
>>>>>> 38       Memory Serialized 1x Replicated 193     97%     5.6 GB  0.0 B   
>>>>>> 0.0 B
>>>>>> userInBlocks     Memory Deserialized 1x Replicated       100     100%    
>>>>>> 2.8 GB  0.0 B   0.0 B
>>>>>> itemFactors-1    Memory Deserialized 1x Replicated       69      69%     
>>>>>> 8.4 MB  0.0 B   0.0 B
>>>>>> itemInBlocks     Memory Deserialized 1x Replicated       69      69%     
>>>>>> 1455.3 MB       0.0 B   0.0 B
>>>>>> userFactors-1    Memory Deserialized 1x Replicated       100     100%    
>>>>>> 35.0 GB 0.0 B   0.0 B
>>>>>> userOutBlocks    Memory Deserialized 1x Replicated       100     100%    
>>>>>> 1062.7 MB       0.0 B   0.0 B
>>>>>> 
>>>>>>> On Jun 26, 2015, at 8:26 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>>>>>> 
>>>>>>>  number of CPU cores or less.

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

Reply via email to