Hi,
I'm running Spark on YARN carrying out a simple reduceByKey followed by another
reduceByKey after some transformations. After completing the first stage my
Master runs out of memory.
I have 20G assigned to the master, 145 executors (12G each +4G overhead) ,
around 90k input files, 10+TB data, and 2000 reducers AND no Caching.
Below are the are two reduceByKey calls
val myrdd = field1And2.map(x => ( x,1)).reduceByKey(_+_, 2000)
The second one feeds off of the first one
val countHistogram = myrdd.map(x => (x._2,1)).reduceByKey(_+_, 2000)
Any idea what that master is doing gorging so much of data filling up its
space? There's no collect kind of call that can get the data back to the
master.
Thanks,
Vipul