Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly large datasets (1+ billion input records). As I grow my dataset I often run into issues with a lot of failed stages and dropped executors, ultimately leading to the whole application failing. The errors are like "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 19" and "org.apache.spark.shuffle.FetchFailedException: Failed to connect to...". These occur during flatMap, mapPartitions, and aggregate stages. I know that increasing memory fixes this issue, but most of the time my executors are only using a tiny portion of the their allocated memory (<10%). Often, the stages run fine until the last iteration or two of ALS, but this could just be a coincidence.
I've tried tweaking a lot of settings, but it's time-consuming to do this through guess-and-check. Right now I have these set: spark.shuffle.memoryFraction = 0.3 spark.storage.memoryFraction = 0.65 spark.executor.heartbeatInterval = 600000 I'm sure these settings aren't optimal - any idea of what could be causing my errors, and what direction I can push these settings in to get more out of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1 billion record dataset, which seems like too much. Thanks!