I'm not caching the data. with "each iteration I mean,, each 128mb
that a executor has to process.

The code is pretty simple.

final Conversor c = new Conversor(null, null, null, longFields,typeFields);
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock());

 JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() {
     @Override
      public String call(byte[] arg0) throws Exception {
         String result = c.parse(arg0).toString();
          return result;
    }
 });
rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ "/");

The parse function just takes an array of bytes and applies some
transformations like,,,
[0..3] an integer, [4...20] an String, [21..27] another String and so on.

It's just a test code, I'd like to understand what it's happeing.

2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>:
> Hi Guillermo,
>
> What exactly do you mean by "each iteration"?  Are you caching data in
> memory?
>
> -Sandy
>
> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com>
> wrote:
>>
>> I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
>> I have 5 slaves:
>> (32cores /256Gb / 7physical disks) x 5
>>
>> I have been trying many different configurations with YARN.
>> yarn.nodemanager.resource.memory-mb 196Gb
>> yarn.nodemanager.resource.cpu-vcores 24
>>
>> I have tried to execute the job with different number of executors a
>> memory (1-4g)
>> With 20 executors takes 25s each iteration (128mb) and it never has a
>> really long time waiting because GC.
>>
>> When I execute around 60 executors the process time it's about 45s and
>> some tasks take until one minute because GC.
>>
>> I have no idea why it's calling GC when I execute more executors
>> simultaneously.
>> The another question it's why it takes more time to execute each
>> block. My theory about the this it's because there're only 7 physical
>> disks and it's not the same 5 processes writing than 20.
>>
>> The code is pretty simple, it's just a map function which parse a line
>> and write the output in HDFS. There're a lot of substrings inside of
>> the function what it could cause GC.
>>
>> Any theory about?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to