I'm not caching the data. with "each iteration I mean,, each 128mb that a executor has to process.
The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ "/"); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: > Hi Guillermo, > > What exactly do you mean by "each iteration"? Are you caching data in > memory? > > -Sandy > > On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com> > wrote: >> >> I execute a job in Spark where I'm processing a file of 80Gb in HDFS. >> I have 5 slaves: >> (32cores /256Gb / 7physical disks) x 5 >> >> I have been trying many different configurations with YARN. >> yarn.nodemanager.resource.memory-mb 196Gb >> yarn.nodemanager.resource.cpu-vcores 24 >> >> I have tried to execute the job with different number of executors a >> memory (1-4g) >> With 20 executors takes 25s each iteration (128mb) and it never has a >> really long time waiting because GC. >> >> When I execute around 60 executors the process time it's about 45s and >> some tasks take until one minute because GC. >> >> I have no idea why it's calling GC when I execute more executors >> simultaneously. >> The another question it's why it takes more time to execute each >> block. My theory about the this it's because there're only 7 physical >> disks and it's not the same 5 processes writing than 20. >> >> The code is pretty simple, it's just a map function which parse a line >> and write the output in HDFS. There're a lot of substrings inside of >> the function what it could cause GC. >> >> Any theory about? >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org