Hi. I'm running some comparisons between flink, MRv2, and spark(1.3), using the new Intel HiBench suite. I've started with the stock workcount example and I'm seeing some numbers which are not where I thought I'd be.
So the question I have is what the the configuration parameters which can affect the performance? Is there a performance/tuning guide. What we have – hardware wise are 48 Haswell/32 physical/64 HT cores with 128 GB, FDR connect nodes. I'm parsing 2TB of text, using the following parameters. ./bin/flink run -m yarn-cluster \ -yD fs.overwrite-files=true \ -yD fs.output.always-create-directory=true \ -yq \ -yn $((666)) \ -yD taskmanager.numberOfTaskSlots=$((1)) \ -yD parallelization.degree.default=$((666)) \ -ytm $((4*1024)) \ -yjm $((4*1024)) \ ./examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar \ hdfs:///user/jsparks/HiBench/Wordcount/Input \ hdfs:///user/jsparks/HiBench/Wordcount/Output Any pointers would be greatly appreciated. Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node HadoopWordcount 2015-06-03 10:45:11 2052360935068 763.106 2689483420 2689483420 JavaSparkWordcount 2015-06-03 10:55:24 2052360935068 411.246 4990591847 4990591847 ScalaSparkWordcount 2015-06-03 11:06:24 2052360935068 342.777 5987452294 5987452294 Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node flinkWordCount 2015-06-04 16:27:27 2052360935068 647.383 3170242244 66046713 -- Jonathan (Bill) Sparks Software Architecture Cray Inc.