Hi,

I am using Spark 1.3.1 on EMR with lots of memory.  I have attempted to run
a large pyspark job several times, specifying `spark.shuffle.spill=false`
in different ways.  It seems that the setting is ignored, at least
partially, and some of the tasks start spilling large amounts of data to
disk.  The job has been fast enough in the past, but once it starts
spilling to disk it lands on Miller's planet [1].

Is this expected behavior?  Is it a misconfiguration on my part, e.g.,
could there be an incompatible setting that is overriding
`spark.shuffle.spill=false`?  Is it something that goes back to Spark
1.3.1?  Is it something that goes back to EMR?  When I've allowed the job
to continue on for a while, I've started to see Kryo stack traces in the
tasks that are spilling to disk.  The stack traces mention there not being
enough disk space, although a `df` shows plenty of space (perhaps after the
fact, when temporary files have been cleaned up).

Has anyone run into something like this before?  I would be happy to see
OOM errors, because that would be consistent with one understanding of what
might be going on, but I haven't yet.

Eric


[1] https://www.youtube.com/watch?v=v7OVqXm7_Pk&safe=active

Reply via email to