That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to?
-Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > Any idea why if I use more containers I get a lot of stopped because GC? > > 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>: > > I'm not caching the data. with "each iteration I mean,, each 128mb > > that a executor has to process. > > > > The code is pretty simple. > > > > final Conversor c = new Conversor(null, null, null, > longFields,typeFields); > > SparkConf conf = new SparkConf().setAppName("Simple Application"); > > JavaSparkContext sc = new JavaSparkContext(conf); > > JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock()); > > > > JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() { > > @Override > > public String call(byte[] arg0) throws Exception { > > String result = c.parse(arg0).toString(); > > return result; > > } > > }); > > rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ > "/"); > > > > The parse function just takes an array of bytes and applies some > > transformations like,,, > > [0..3] an integer, [4...20] an String, [21..27] another String and so on. > > > > It's just a test code, I'd like to understand what it's happeing. > > > > 2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: > >> Hi Guillermo, > >> > >> What exactly do you mean by "each iteration"? Are you caching data in > >> memory? > >> > >> -Sandy > >> > >> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com> > >> wrote: > >>> > >>> I execute a job in Spark where I'm processing a file of 80Gb in HDFS. > >>> I have 5 slaves: > >>> (32cores /256Gb / 7physical disks) x 5 > >>> > >>> I have been trying many different configurations with YARN. > >>> yarn.nodemanager.resource.memory-mb 196Gb > >>> yarn.nodemanager.resource.cpu-vcores 24 > >>> > >>> I have tried to execute the job with different number of executors a > >>> memory (1-4g) > >>> With 20 executors takes 25s each iteration (128mb) and it never has a > >>> really long time waiting because GC. > >>> > >>> When I execute around 60 executors the process time it's about 45s and > >>> some tasks take until one minute because GC. > >>> > >>> I have no idea why it's calling GC when I execute more executors > >>> simultaneously. > >>> The another question it's why it takes more time to execute each > >>> block. My theory about the this it's because there're only 7 physical > >>> disks and it's not the same 5 processes writing than 20. > >>> > >>> The code is pretty simple, it's just a map function which parse a line > >>> and write the output in HDFS. There're a lot of substrings inside of > >>> the function what it could cause GC. > >>> > >>> Any theory about? > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > >> >