You can bump up number of partition by a parameter in join operator. However you have a data skew problem which you need to resolve using a reasonable partition by function On 7 Jul 2015 08:57, "Mohammed Omer" <[email protected]> wrote:
> Afternoon all, > > Really loving this project and the community behind it. Thank you all for > your hard work. > > > This past week, though, I've been having a hard time getting my first > deployed job to run without failing at the same point every time: Right > after a leftOuterJoin, most partitions (600 total) are small (1-100MB), > while some others are large (3-6GB). The large ones consistently spill > 20-60GB into memory, and eventually fail. > > > If I could only get the partitions to be smaller, right out of the > leftOuterJoin, it seems like the job would run fine. > > > I've tried trawling through the logs, but it hasn't been very fruitful in > finding out what, specifically, is the issue. > > > Cluster setup: > > * 6 worker nodes (16 cores, 104GB Memory, 500GB storage) > > * 1 master (same config as above) > > > Running Spark on YARN, with: > > > storage.memoryFraction = .3 > > --executors = 6 > > --executor-cores = 12 > > --executor-memory = kind of confusing due to YARN, but basically in the > Spark monitor site's Executors page, it shows each as running with 18.8GB > memory, though I know usage is much larger due to YARN managing various > pieces. (Total memory available to yarn shows 480GB, with 270GB currently > used). > > > Screenshot of the task page: http://i.imgur.com/xG3KdEl.png > > Code: > https://gist.github.com/momer/8bc03c60a639e5c04eda#file-spark-scala-L60 (see > line 60 for the relevant area) > > > > Any pointers in the right direction, or advice on articles to read, or > even debugging / settings advice or recommendations would be extremely > helpful. I'll put a bounty on this of $50 donation to the ASF! :D > > > Thank you all for reading (and hopefully replying!), > > > Mo Omer >
