Re: Problem when sorting big file

Andrew Ash Mon, 19 May 2014 20:59:02 -0700

Is your RDD of Strings?  If so, you should make sure to use the Kryo
serializer instead of the default Java one.  It stores strings as UTF8
rather than Java's default UTF16 representation, which can save you half
the memory usage in the right situation.


Try setting the persistence level on the RDD to MEMORY_AND_DISK_SER and
possibly also lowering spark.storage.memoryFraction from 0.6 to 0.4 or so.

Andrew


On Thu, May 15, 2014 at 2:55 PM, Gustavo Enrique Salazar Torres <
gsala...@ime.usp.br> wrote:

> Hi there:
>
> I have this dataset (about 12G) which I need to sort by key.
> I used the sortByKey method but when I try to save the file to disk (HDFS
> in this case) it seems that some tasks run out of time because they have
> too much data to save and it can't fit in memory.
> I say this because before the TimeOut exception at the worker there is an
> OOM exception from an specific task.
> My question is: is this a common problem at Spark? has anyone been through
> this issue?
> The cause of the problem seems to be an unbalanced distribution of data
> between tasks.
>
> I will appreciate any help.
>
> Thanks
> Gustavo
>

Re: Problem when sorting big file

Reply via email to