Re: What should happen if we try to cache more data than the cluster can hold in memory?

Patrick Wendell Mon, 04 Aug 2014 01:12:11 -0700

BTW - the reason why the workaround could help is because when persisting
to DISK_ONLY, we explicitly avoid materializing the RDD partition in
memory... we just pass it through to disk



On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell <pwend...@gmail.com> wrote:

> It seems possible that you are running out of memory unrolling a single
> partition of the RDD. This is something that can cause your executor to
> OOM, especially if the cache is close to being full so the executor doesn't
> have much free memory left. How large are your executors? At the time of
> failure, is the cache already nearly full?
>
> I also believe the Snappy compression codec in Hadoop is not splittable.
> This means that each of your JSON files is read in its entirety as one
> spark partition. If you have files that are larger than the standard block
> size (128MB), it will exacerbate this shortcoming of Spark. Incidentally,
> this means minPartitions won't help you at all here.
>
> This is fixed in the master branch and will be fixed in Spark 1.1. As a
> debugging step (if this is doable), it's worth running this job on the
> master branch and seeing if it succeeds.
>
> https://github.com/apache/spark/pull/1165
>
> A (potential) workaround would be to first persist your data to disk, then
> re-partition it, then cache it. I'm not 100% sure whether that will work
> though.
>
>  val a = 
> sc.textFile("s3n://some-path/*.json").persist(DISK_ONLY).repartition(larger
> nr of partitions).cache()
>
> - Patrick
>
>
> On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Isn't this your worker running out of its memory for computations,
>>> rather than for caching RDDs?
>>>
>> I'm not sure how to interpret the stack trace, but let's say that's true.
>> I'm even seeing this with a simple a = sc.textFile().cache() and then
>> a.count(). Spark shouldn't need that much memory for this kind of work,
>> no?
>>
>> then the answer is that you should tell
>>> it to use less memory for caching.
>>>
>> I can try that. That's done by changing spark.storage.memoryFraction,
>> right?
>>
>> This still seems strange though. The default fraction of the JVM left for
>> non-cache activity (1 - 0.6 = 40%
>> <http://spark.apache.org/docs/latest/configuration.html#execution-behavior>)
>> should be plenty for just counting elements. I'm using m1.xlarge nodes
>> that have 15GB of memory apiece.
>>
>> Nick
>>
>
>

Re: What should happen if we try to cache more data than the cluster can hold in memory?

Reply via email to