How many times did you run the experiment at each setting? What is the standard deviation for each of these settings. It could be that you are simply running into the error bounds of Hadoop. Hadoop is far from consistent in it's performance. For our benchmarking we typically will run the test 5 times, throw out the top and bottom result, as possibly outliers and then average the other runs. Even with that we have to be very careful that we weed out bad nodes or the numbers are useless for comparison. The other thing to look at is where was all of the time spent for each of these settings. The map portion should be very close to linear with the number of tasks, assuming that there is no disk or network contention. The shuffle is far from linear as the number of fetches is a function of the number of maps and the number of reducers. The reduce phase itself should be close to linear assuming that there isn't much skew to your data.
--Bobby On 9/7/13 3:33 AM, "牛兆捷" <nzjem...@gmail.com> wrote: >But I still want to fine the most efficient assignment and scale both data >and nodes as you said, for example in my result, 2 is the best, and 8 is >better than 4. > >Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is >hard to model this result. Can you give me some hint about this kind of >trend? > > >2013/9/7 Vinod Kumar Vavilapalli <vino...@hortonworks.com> > >> >> Clearly your input size isn't changing. And depending on how they are >> distributed on the nodes, there could be Datanode/disks contention. >> >> The better way to model this is by scaling the input data also linearly. >> More nodes should process more data in the same amount of time. >> >> Thanks, >> +Vinod >> >> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote: >> >> > Hi all: >> > >> > I vary the computational nodes of cluster and get the speedup result >>in >> attachment. >> > >> > In my mind, there are three type of speedup model: linear, sub-linear >> and super-linear. However the curve of my result seems a little >>strange. I >> have attached it. >> > <speedup.png> >> > >> > This is sort in example.jar, actually it is done only using the >>default >> map-reduce mechanism of Hadoop. >> > >> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 >>cpu, >> 20g men) >> > io.sort.mb = 512, block size = 512mb, heap size = 1024mb, >> reduce.slowstart = 0.05, the others are default. >> > >> > Input data: 20g, I divide it to 64 files >> > >> > Sort example: 64 map tasks, 64 reduce tasks >> > >> > Computational nodes: varying from 2 to 9 >> > >> > Why the speedup mechanism is like this? How can I model it properly? >> > >> > Thanks〜 >> > >> > -- >> > Sincerely, >> > Zhaojie >> > >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or >>entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the >>reader >> of this message is not the intended recipient, you are hereby notified >>that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender >>immediately >> and delete it from your system. Thank You. >> > > > >-- >*Sincerely,* >*Zhaojie* >* >*