Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Xiangrui Meng Tue, 23 Jun 2015 18:20:56 -0700

It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
more information to guess what happened:


1. Could you share the ALS settings, e.g., number of blocks, rank and
number of iterations, as well as number of users/items in your
dataset?
2. If you monitor the progress in the WebUI, how much data is stored
in memory and how much data is shuffled per iteration?
3. Do you have enough disk space for the shuffle files?
4. Did you set checkpointDir in SparkContext and checkpointInterval in ALS?

Best,
Xiangrui

On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly
> large datasets (1+ billion input records). As I grow my dataset I often run
> into issues with a lot of failed stages and dropped executors, ultimately
> leading to the whole application failing. The errors are like
> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
> location for shuffle 19" and "org.apache.spark.shuffle.FetchFailedException:
> Failed to connect to...". These occur during flatMap, mapPartitions, and
> aggregate stages. I know that increasing memory fixes this issue, but most
> of the time my executors are only using a tiny portion of the their
> allocated memory (<10%). Often, the stages run fine until the last iteration
> or two of ALS, but this could just be a coincidence.
>
> I've tried tweaking a lot of settings, but it's time-consuming to do this
> through guess-and-check. Right now I have these set:
> spark.shuffle.memoryFraction = 0.3
> spark.storage.memoryFraction = 0.65
> spark.executor.heartbeatInterval = 600000
>
> I'm sure these settings aren't optimal - any idea of what could be causing
> my errors, and what direction I can push these settings in to get more out
> of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1
> billion record dataset, which seems like too much. Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to