Dataset/dataframes will use direct/raw/off-heap memory in the most
efficient columnar fashion. Trying to fit the same amount of data in heap
memory would likely increase your memory requirement and decrease the
speed.

So, in short, don't worry about it and increase overhead. You can also set
a bound on off heap memory via some option.

thanks,
rohitk

On Thu, Nov 24, 2016 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote:

> we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we
> keep running into is containers getting killed by yarn. i realize this has
> to do with off-heap memory, and the suggestion is to increase
> spark.yarn.executor.memoryOverhead.
>
> at times our memoryOverhead is as large as the executor memory (say 4G and
> 4G).
>
> why is Dataset/Dataframe using so much off heap memory?
>
> we havent changed spark.memory.offHeap.enabled which defaults to false.
> should we enable that to get a better handle on this?
>

Reply via email to