Hello, I would like to use pigmix data generator and jobs to create arbitrary datasets and workloads for testing my scheduler. I'm experiencing some problems with the data generator and I would like to know if I'm doing something wrong and how to use the data generator effectively.
My cluster is composed by 4 machines with both hdfs and mapreduce daemons running on them. Among those, one is also the namenode and jobtracker. I'm using the FIFO scheduler and Hadoop 1.0.4. HDFS has blocksize 64 MB. To generate the dataset, I use the data generator in this way (from the pigmix directory): ./scripts/generate_data.sh 10 100000 data_100000_10 The script runs and in the end I obtain 56 files in 6 different directories in HDFS. The problem is that each file is roughly 16MB each, small compared to an HDFS block size and mappers reading those files are blazing fast. My first question is: is it possible to concatenate all the output files of the datagenerator in each directory to obtain a file of the correct size? One solution that I'm trying right now is to deduce the size of my output based on the number of files * block size. If I have 54 files and block size 64 then I need 3.5 GB of output to have a file size that is similar to the block size. But then I saw that generated files in different directories have different sizes. My second question is: to create files of a size comparable to the HDFS block size what I have to do? My last question is: is it normal that the data generator takes so much time? To generate 1 GB of output it took one hours but I don't understand why. Thanks for your time, NN
