Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly
large datasets (1+ billion input records). As I grow my dataset I often run
into issues with a lot of failed stages and dropped executors, ultimately
leading to the whole application failing. The errors are like
"org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 19" and
"org.apache.spark.shuffle.FetchFailedException: Failed to connect to...".
These occur during flatMap, mapPartitions, and aggregate stages. I know
that increasing memory fixes this issue, but most of the time my executors
are only using a tiny portion of the their allocated memory (<10%). Often,
the stages run fine until the last iteration or two of ALS, but this could
just be a coincidence.

I've tried tweaking a lot of settings, but it's time-consuming to do this
through guess-and-check. Right now I have these set:
spark.shuffle.memoryFraction = 0.3
spark.storage.memoryFraction = 0.65
spark.executor.heartbeatInterval = 600000

I'm sure these settings aren't optimal - any idea of what could be causing
my errors, and what direction I can push these settings in to get more out
of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1
billion record dataset, which seems like too much. Thanks!

Reply via email to