Hello, Thank you for the script! I run it and got total execution time for cat-ing:
major minor fs_in fs_out wall user sys ctx_invol ctx_vol 2 17049 0 0 1.23 1.60 0.11 18 1113 2 17170 0 0 1.22 1.61 0.10 22 1023 2 17326 0 0 1.22 1.61 0.10 33 1049 2 17222 0 0 1.22 1.61 0.11 23 1020 2 18047 0 0 1.22 1.62 0.09 18 1033 2 18259 0 0 *1.27* 1.61 0.11 23 1068 2 17555 0 0 1.22 1.62 0.09 35 1018 2 17633 0 0 1.22 1.61 0.10 21 1036 2 17459 0 0 1.22 1.61 0.10 32 1059 2 18040 0 0 1.22 1.61 0.10 32 1043 Using reps_per_run(50) and num_trials(10), the script cat the file 50 times. Why not just 2 or more (from the second iteration file is in buffer cache). Also, I looked at the results and found an outlier (1.27). I would assume execution time is longer due to load of machine at the time? I would like to get further information such as the cpu time and network bandwidth consumed per node for a command. Do you know if Cloudera adds hook points to CDH3 to measure these? Are there any other benchmarking scripts? Thanks, Keren On Mon, Jul 18, 2011 at 7:55 AM, Todd Lipcon <t...@cloudera.com> wrote: > For benchmarking CPU, I start a pseudo-distributed HDFS cluster, put a > smallish file on the local datanode (such that it fits in buffer cache), > and > then use the following script with various parameters to look at CPU usage > to cat the file. for example: > > $ REPS_PER_RUN=50 NUM_TRIALS=10 ./read-benchmark.sh > hdfs://localhost/128M-file /tmp/benchmark-results.txt > > Script: > > #!/bin/sh -x > set -e > BINDIR=$(dirname $0) > > INPUT=$1 > OUTPUT=$2 > NUM_TRIALS=${NUM_TRIALS:-10} > HADOOP=${HADOOP:-./bin/hadoop} > HADOOP_FLAGS=${HADOOP_FLAGS:--Dio.file.buffer.size=$[64*1024]} > REPS_PER_RUN=${REPS_PER_RUN:-1} > > > HEADER="major\tminor\tfs_in\tfs_out\twall\tuser\tsys\tctx_invol\tctx_vol\n" > TIME_FORMAT="%F\t%R\t%I\t%O\t%e\t%U\t%S\t%c\t%w" > > ! test -f $OUTPUT && printf $HEADER > $OUTPUT > for x in `seq 1 $NUM_TRIALS` ; do > /usr/bin/time --append -o $OUTPUT -f $TIME_FORMAT \ > $HADOOP fs $HADOOP_FLAGS -cat $(for rep in $(seq 1 $REPS_PER_RUN) ; > do echo $INPUT ; done) > /dev/null > done > > > On Wed, Jul 6, 2011 at 1:16 AM, Keren Ouaknine <ker...@gmail.com> wrote: > > > Hello, > > > > I am working on the optimization of task scheduling for Hadoop and would > > like to benchmark with* Apache Hadoop's standards benchmarks*. So far, I > > used my own scripts to measure and monitor. Where can I find the > > benchmarking you are referring to please? > > > > Thanks, > > Keren > > > > On Wed, Jul 6, 2011 at 7:32 AM, Todd Lipcon (JIRA) <j...@apache.org> > > wrote: > > > > > Simplify BlockReader to not inherit from FSInputChecker > > > ------------------------------------------------------- > > > > > > Key: HDFS-2129 > > > URL: https://issues.apache.org/jira/browse/HDFS-2129 > > > Project: Hadoop HDFS > > > Issue Type: Sub-task > > > Components: hdfs client > > > Reporter: Todd Lipcon > > > Assignee: Todd Lipcon > > > > > > > > > BlockReader is currently quite complicated since it has to conform to > the > > > FSInputChecker inheritance structure. It would be much simpler to > > implement > > > it standalone. Benchmarking indicates it's slightly faster, as well. > > > > > > -- > > > This message is automatically generated by JIRA. > > > For more information on JIRA, see: > > http://www.atlassian.com/software/jira > > > > > > > > > > > > > > > -- > > Keren Ouaknine > > Cell: +972 54 2565404 > > Web: www.kereno.com > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Keren Ouaknine Cell: +972 54 2565404 Web: www.kereno.com