Hi Guillermo, What exactly do you mean by "each iteration"? Are you caching data in memory?
-Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > I execute a job in Spark where I'm processing a file of 80Gb in HDFS. > I have 5 slaves: > (32cores /256Gb / 7physical disks) x 5 > > I have been trying many different configurations with YARN. > yarn.nodemanager.resource.memory-mb 196Gb > yarn.nodemanager.resource.cpu-vcores 24 > > I have tried to execute the job with different number of executors a > memory (1-4g) > With 20 executors takes 25s each iteration (128mb) and it never has a > really long time waiting because GC. > > When I execute around 60 executors the process time it's about 45s and > some tasks take until one minute because GC. > > I have no idea why it's calling GC when I execute more executors > simultaneously. > The another question it's why it takes more time to execute each > block. My theory about the this it's because there're only 7 physical > disks and it's not the same 5 processes writing than 20. > > The code is pretty simple, it's just a map function which parse a line > and write the output in HDFS. There're a lot of substrings inside of > the function what it could cause GC. > > Any theory about? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >