Thanks bobby. I will try more times.
Is there any more fine-grained profile tools for each task? For example, cpu utilization, disk and network IO for each task. 2013/9/9 Robert Evans <ev...@yahoo-inc.com> > How many times did you run the experiment at each setting? What is the > standard deviation for each of these settings. It could be that you are > simply running into the error bounds of Hadoop. Hadoop is far from > consistent in it's performance. For our benchmarking we typically will > run the test 5 times, throw out the top and bottom result, as possibly > outliers and then average the other runs. Even with that we have to be > very careful that we weed out bad nodes or the numbers are useless for > comparison. The other thing to look at is where was all of the time spent > for each of these settings. The map portion should be very close to > linear with the number of tasks, assuming that there is no disk or network > contention. The shuffle is far from linear as the number of fetches is a > function of the number of maps and the number of reducers. The reduce > phase itself should be close to linear assuming that there isn't much skew > to your data. > > --Bobby > > On 9/7/13 3:33 AM, "牛兆捷" <nzjem...@gmail.com> wrote: > > >But I still want to fine the most efficient assignment and scale both data > >and nodes as you said, for example in my result, 2 is the best, and 8 is > >better than 4. > > > >Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is > >hard to model this result. Can you give me some hint about this kind of > >trend? > > > > > >2013/9/7 Vinod Kumar Vavilapalli <vino...@hortonworks.com> > > > >> > >> Clearly your input size isn't changing. And depending on how they are > >> distributed on the nodes, there could be Datanode/disks contention. > >> > >> The better way to model this is by scaling the input data also linearly. > >> More nodes should process more data in the same amount of time. > >> > >> Thanks, > >> +Vinod > >> > >> On Sep 6, 2013, at 8:27 AM, 牛兆捷 wrote: > >> > >> > Hi all: > >> > > >> > I vary the computational nodes of cluster and get the speedup result > >>in > >> attachment. > >> > > >> > In my mind, there are three type of speedup model: linear, sub-linear > >> and super-linear. However the curve of my result seems a little > >>strange. I > >> have attached it. > >> > <speedup.png> > >> > > >> > This is sort in example.jar, actually it is done only using the > >>default > >> map-reduce mechanism of Hadoop. > >> > > >> > I use hadoop-1.2.1, set 8 map slots and 8 reduce slots per node(12 > >>cpu, > >> 20g men) > >> > io.sort.mb = 512, block size = 512mb, heap size = 1024mb, > >> reduce.slowstart = 0.05, the others are default. > >> > > >> > Input data: 20g, I divide it to 64 files > >> > > >> > Sort example: 64 map tasks, 64 reduce tasks > >> > > >> > Computational nodes: varying from 2 to 9 > >> > > >> > Why the speedup mechanism is like this? How can I model it properly? > >> > > >> > Thanks~ > >> > > >> > -- > >> > Sincerely, > >> > Zhaojie > >> > > >> > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or > >>entity to > >> which it is addressed and may contain information that is confidential, > >> privileged and exempt from disclosure under applicable law. If the > >>reader > >> of this message is not the intended recipient, you are hereby notified > >>that > >> any printing, copying, dissemination, distribution, disclosure or > >> forwarding of this communication is strictly prohibited. If you have > >> received this communication in error, please contact the sender > >>immediately > >> and delete it from your system. Thank You. > >> > > > > > > > >-- > >*Sincerely,* > >*Zhaojie* > >* > >* > > -- *Sincerely,* *Zhaojie* * *