It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more information to guess what happened:
1. Could you share the ALS settings, e.g., number of blocks, rank and number of iterations, as well as number of users/items in your dataset? 2. If you monitor the progress in the WebUI, how much data is stored in memory and how much data is shuffled per iteration? 3. Do you have enough disk space for the shuffle files? 4. Did you set checkpointDir in SparkContext and checkpointInterval in ALS? Best, Xiangrui On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote: > Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly > large datasets (1+ billion input records). As I grow my dataset I often run > into issues with a lot of failed stages and dropped executors, ultimately > leading to the whole application failing. The errors are like > "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 19" and "org.apache.spark.shuffle.FetchFailedException: > Failed to connect to...". These occur during flatMap, mapPartitions, and > aggregate stages. I know that increasing memory fixes this issue, but most > of the time my executors are only using a tiny portion of the their > allocated memory (<10%). Often, the stages run fine until the last iteration > or two of ALS, but this could just be a coincidence. > > I've tried tweaking a lot of settings, but it's time-consuming to do this > through guess-and-check. Right now I have these set: > spark.shuffle.memoryFraction = 0.3 > spark.storage.memoryFraction = 0.65 > spark.executor.heartbeatInterval = 600000 > > I'm sure these settings aren't optimal - any idea of what could be causing > my errors, and what direction I can push these settings in to get more out > of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1 > billion record dataset, which seems like too much. Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org