Hello,

I would like to use pigmix data generator and jobs to create arbitrary
datasets and workloads for testing my scheduler. I'm experiencing some
problems with the data generator and I would like to know if I'm doing
something wrong and how to use the data generator effectively.

My cluster is composed by 4 machines with both hdfs and mapreduce daemons
running on them. Among those, one is also the namenode and jobtracker. I'm
using the FIFO scheduler and Hadoop 1.0.4. HDFS has blocksize 64 MB.

To generate the dataset, I use the data generator in this way (from the
pigmix directory):

./scripts/generate_data.sh 10 100000 data_100000_10

The script runs and in the end I obtain 56 files in 6 different directories
in HDFS. The problem is that each file is roughly 16MB each, small compared
to an HDFS block size and mappers reading those files are blazing fast. My
first question is: is it possible to concatenate all the output files of
the datagenerator in each directory to obtain a file of the correct size?

One solution that I'm trying right now is to deduce the size of my output
based on the number of files * block size. If I have 54 files and block
size 64 then I need 3.5 GB of output to have a file size that is similar to
the block size. But then I saw that generated files in different
directories have different sizes. My second question is: to create files of
a size comparable to the HDFS block size what I have to do?

My last question is: is it normal that the data generator takes so much
time? To generate 1 GB of output it took one hours but I don't understand
why.

Thanks for your time,
NN

Reply via email to