Hello,

Thank you for the script! I run it and got total execution time for cat-ing:

major minor fs_in fs_out wall user sys ctx_invol ctx_vol
2 17049 0 0 1.23 1.60 0.11 18 1113
2 17170 0 0 1.22 1.61 0.10 22 1023
2 17326 0 0 1.22 1.61 0.10 33 1049
2 17222 0 0 1.22 1.61 0.11 23 1020
2 18047 0 0 1.22 1.62 0.09 18 1033
2 18259 0 0 *1.27* 1.61 0.11 23 1068
2 17555 0 0 1.22 1.62 0.09 35 1018
2 17633 0 0 1.22 1.61 0.10 21 1036
2 17459 0 0 1.22 1.61 0.10 32 1059
2 18040 0 0 1.22 1.61 0.10 32 1043


Using reps_per_run(50) and num_trials(10), the script cat the file 50 times.
Why not just 2 or more (from the second iteration file is in buffer cache).
Also, I looked at the results and found an outlier (1.27). I would assume
execution time is longer due to load of machine at the time?

I would like to get further information such as the cpu time and network
bandwidth consumed per node for a command. Do you know if Cloudera adds hook
points to CDH3 to measure these? Are there any other benchmarking scripts?

Thanks,
Keren



On Mon, Jul 18, 2011 at 7:55 AM, Todd Lipcon <t...@cloudera.com> wrote:

> For benchmarking CPU, I start a pseudo-distributed HDFS cluster, put a
> smallish file on the local datanode (such that it fits in buffer cache),
> and
> then use the following script with various parameters to look at CPU usage
> to cat the file. for example:
>
> $ REPS_PER_RUN=50 NUM_TRIALS=10 ./read-benchmark.sh
> hdfs://localhost/128M-file /tmp/benchmark-results.txt
>
> Script:
>
> #!/bin/sh -x
> set -e
> BINDIR=$(dirname $0)
>
> INPUT=$1
> OUTPUT=$2
> NUM_TRIALS=${NUM_TRIALS:-10}
> HADOOP=${HADOOP:-./bin/hadoop}
> HADOOP_FLAGS=${HADOOP_FLAGS:--Dio.file.buffer.size=$[64*1024]}
> REPS_PER_RUN=${REPS_PER_RUN:-1}
>
>
> HEADER="major\tminor\tfs_in\tfs_out\twall\tuser\tsys\tctx_invol\tctx_vol\n"
> TIME_FORMAT="%F\t%R\t%I\t%O\t%e\t%U\t%S\t%c\t%w"
>
> ! test -f $OUTPUT && printf $HEADER > $OUTPUT
> for x in `seq 1 $NUM_TRIALS` ; do
>    /usr/bin/time --append -o $OUTPUT -f $TIME_FORMAT \
>        $HADOOP fs $HADOOP_FLAGS -cat $(for rep in $(seq 1 $REPS_PER_RUN) ;
> do echo $INPUT ; done) > /dev/null
> done
>
>
> On Wed, Jul 6, 2011 at 1:16 AM, Keren Ouaknine <ker...@gmail.com> wrote:
>
> > Hello,
> >
> > I am working on the optimization of task scheduling for Hadoop and would
> > like to benchmark with* Apache Hadoop's standards benchmarks*. So far, I
> > used my own scripts to measure and monitor. Where can I find the
> > benchmarking you are referring to please?
> >
> > Thanks,
> > Keren
> >
> > On Wed, Jul 6, 2011 at 7:32 AM, Todd Lipcon (JIRA) <j...@apache.org>
> > wrote:
> >
> > > Simplify BlockReader to not inherit from FSInputChecker
> > > -------------------------------------------------------
> > >
> > >                 Key: HDFS-2129
> > >                 URL: https://issues.apache.org/jira/browse/HDFS-2129
> > >             Project: Hadoop HDFS
> > >          Issue Type: Sub-task
> > >          Components: hdfs client
> > >            Reporter: Todd Lipcon
> > >            Assignee: Todd Lipcon
> > >
> > >
> > > BlockReader is currently quite complicated since it has to conform to
> the
> > > FSInputChecker inheritance structure. It would be much simpler to
> > implement
> > > it standalone. Benchmarking indicates it's slightly faster, as well.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> > >
> > >
> > >
> >
> >
> > --
> > Keren Ouaknine
> > Cell: +972 54 2565404
> > Web: www.kereno.com
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Keren Ouaknine
Cell: +972 54 2565404
Web: www.kereno.com

Reply via email to