Re: trouble with "join" on large RDDs

Patrick Wendell Mon, 07 Apr 2014 21:42:07 -0700

On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller <bmill...@eecs.berkeley.edu>wrote:


> I am running the latest version of PySpark branch-0.9 and having some
> trouble with join.
>
> One RDD is about 100G (25GB compressed and serialized in memory) with
> 130K records, the other RDD is about 10G (2.5G compressed and
> serialized in memory) with 330K records.  I load each RDD from HDFS,
> invoke keyBy to key each record, and then attempt to join the RDDs.
>
> The join consistently crashes at the beginning of the reduce phase.
> Note that when joining the 10G RDD to itself there is no problem.
>
> Prior to the crash, several suspicious things happen:
>
> -All map output files from the map phase of the join are written to
> spark.local.dir, even though there should be plenty of memory left to
> contain the map output.  I am reasonably sure *all* map outputs are
> written to disk because the size of the map output spill directory
> matches the size of the shuffle write (as displayed in the user
> interface) for each machine.
>

The shuffle data is written through the buffer cache of the operating
system, so you would expect the files to show up there immediately and
probably to show up as being their full size when you do "ls". In reality
though these are likely residing in the OS cache and not on disk.


> -In the beginning of the reduce phase of the join, memory consumption
> on each work spikes and each machine runs out of memory (as evidenced
> by a "Cannon allocate memory" exception in Java).  This is
> particularly surprising since each machine has 30G of ram and each
> spark worker has only been allowed 10G.
>

Could you paste the error here?


> -In the web UI both the "Shuffle Spill (Memory)" and "Shuffle Spill
> (Disk)" fields for each machine remain at 0.0 despite shuffle files
> being written into spark.local.dir.
>

Shuffle spill is different than the shuffle files written to
spark.local.dir. Shuffle spilling is for aggregations that occur on the
reduce side of the shuffle. A join like this might not see any spilling.



>From the logs it looks like your executor has died. Would you be able to
paste the log from the executor with the exact failure? It would show up in
the /work directory inside of spark's directory on the cluster.

Re: trouble with "join" on large RDDs

Reply via email to