BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk
On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell <pwend...@gmail.com> wrote: > It seems possible that you are running out of memory unrolling a single > partition of the RDD. This is something that can cause your executor to > OOM, especially if the cache is close to being full so the executor doesn't > have much free memory left. How large are your executors? At the time of > failure, is the cache already nearly full? > > I also believe the Snappy compression codec in Hadoop is not splittable. > This means that each of your JSON files is read in its entirety as one > spark partition. If you have files that are larger than the standard block > size (128MB), it will exacerbate this shortcoming of Spark. Incidentally, > this means minPartitions won't help you at all here. > > This is fixed in the master branch and will be fixed in Spark 1.1. As a > debugging step (if this is doable), it's worth running this job on the > master branch and seeing if it succeeds. > > https://github.com/apache/spark/pull/1165 > > A (potential) workaround would be to first persist your data to disk, then > re-partition it, then cache it. I'm not 100% sure whether that will work > though. > > val a = > sc.textFile("s3n://some-path/*.json").persist(DISK_ONLY).repartition(larger > nr of partitions).cache() > > - Patrick > > > On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote: >> >> Isn't this your worker running out of its memory for computations, >>> rather than for caching RDDs? >>> >> I'm not sure how to interpret the stack trace, but let's say that's true. >> I'm even seeing this with a simple a = sc.textFile().cache() and then >> a.count(). Spark shouldn't need that much memory for this kind of work, >> no? >> >> then the answer is that you should tell >>> it to use less memory for caching. >>> >> I can try that. That's done by changing spark.storage.memoryFraction, >> right? >> >> This still seems strange though. The default fraction of the JVM left for >> non-cache activity (1 - 0.6 = 40% >> <http://spark.apache.org/docs/latest/configuration.html#execution-behavior>) >> should be plenty for just counting elements. I'm using m1.xlarge nodes >> that have 15GB of memory apiece. >> >> Nick >> > >