Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

Sabarish Sasidharan Sat, 27 Jun 2015 20:51:01 -0700

Are you running on top of YARN? Plus pls provide your infrastructure
details.


Regards
Sab
On 28-Jun-2015 8:47 am, "Ayman Farahat" <[email protected]>
wrote:

> Hello;
> I tried to adjust the number of blocks by repartitioning the input.
> Here is How I do it;  (I am partitioning by users )
>
> tot = newrdd.map(lambda l:
> (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
> ratings = tot.values()
> numIterations =8
> rank = 80
> model = ALS.trainImplicit(ratings, rank, numIterations)
>
>
> I have 20 executors
> with 5GM memory per executor.
> When i use 80 factors I keep getting the following problem :
>
> Traceback (most recent call last):
>   File "/homes/afarahat/myspark/mm/df4test.py", line 85, in <module>
>     model = ALS.trainImplicit(ratings, rank, numIterations)
>   File
> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py",
> line 201, in trainImplicit
>   File
> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
> line 128, in callMLlibFunc
>   File
> "/homes/afarahat/aofspark/share/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
> line 121, in callJavaFunc
>   File
> "/homes/afarahat/aofspark/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/homes/afarahat/aofspark/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o113.trainImplicitALSModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 7 in stage 36.1 failed 4 times, most recent failure: Lost task 7.3 in stage
> 36.1 (TID 1841, gsbl52746.blue.ygrid.yahoo.com):
> java.io.FileNotFoundException:
> /grid/3/tmp/yarn-local/usercache/afarahat/appcache/application_1433921068880_1027774/blockmgr-0e518470-57d8-472f-8fba-3b593e4dda42/27/rdd_56_24
> (No such file or directory)
>         at java.io.RandomAccessFile.open(Native Method)
>         at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:110)
>         at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
>         at
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)
>         at
> org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)
>         at
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)
>         at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:70)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
>
> Driver stacktrace:
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> Jun 28, 2015 2:10:37 AM INFO: parquet.hadoop.ParquetFileReader: Initiating
> action with parallelism: 5
> ~
>
> On Jun 26, 2015, at 12:33 PM, Xiangrui Meng <[email protected]> wrote:
>
> So you have 100 partitions (blocks). This might be too many for your
> dataset. Try setting a smaller number of blocks, e.g., 32 or 64. When ALS
> starts iterations, you can see the shuffle read/write size from the
> "stages" tab of Spark WebUI. Vary number of blocks and check the numbers
> there. Kyro serializer doesn't help much here. You can try disabling it
> (though I don't think it caused the failure). -Xiangrui
>
> On Fri, Jun 26, 2015 at 11:00 AM, Ayman Farahat <[email protected]>
> wrote:
>
>> Hello ;
>> I checked on my partitions/storage and here is what I have
>>
>> I have 80 executors
>> 5 G per executore.
>>
>> Do i need to set additional params
>> say cores
>>
>> spark.serializer
>> org.apache.spark.serializer.KryoSerializer
>> # spark.driver.memory              5g
>> # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
>> -Dnumbers="one two three"
>> spark.shuffle.memoryFraction  0.3
>> spark.storage.memoryFraction  0.65
>>
>>
>>
>> RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
>> in TachyonSize on Disk   ratingBlocks
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=44>
>>  Memory
>> Deserialized 1x Replicated 257 129% 4.1 GB 0.0 B 0.0 B  itemOutBlocks
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=53>
>>  Memory
>> Deserialized 1x Replicated 100 100% 7.3 MB 0.0 B 0.0 B  38
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=38>
>>  Memory
>> Serialized 1x Replicated 193 97% 5.6 GB 0.0 B 0.0 B  userInBlocks
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=47>
>>  Memory
>> Deserialized 1x Replicated 100 100% 2.8 GB 0.0 B 0.0 B  itemFactors-1
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=65>
>>  Memory
>> Deserialized 1x Replicated 69 69% 8.4 MB 0.0 B 0.0 B  itemInBlocks
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=52>
>>  Memory
>> Deserialized 1x Replicated 69 69% 1455.3 MB 0.0 B 0.0 B  userFactors-1
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=54>
>>  Memory
>> Deserialized 1x Replicated 100 100% 35.0 GB 0.0 B 0.0 B  userOutBlocks
>> <http://mithrilblue-jt1.blue.ygrid.yahoo.com:8088/proxy/application_1433921068880_943447/storage/rdd?id=48>
>>  Memory
>> Deserialized 1x Replicated 100 100% 1062.7 MB 0.0 B 0.0 B
>>
>> On Jun 26, 2015, at 8:26 AM, Xiangrui Meng <[email protected]> wrote:
>>
>>  number of CPU cores or less.
>>
>>
>>
>
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Same error after the re-partition

Reply via email to