I am running a large streaming job that processes that about 3TB of data I am seeing large jumps in hard drive space usage in the reduce part of the jobs I tracked the problem down. The job is set to compress map outputs but looking at the intermediate files on the local drives the intermediate files are not getting compressed during/after merges. I am going from having say 2Gb of mapfile.out files to having one intermediate.X file sizing 100-350% larger then the map files. I have looked at one of the files and confirmed that it is not getting compressed as I can read the data in it. if it was only one merge then it would not be a problem but when you are merging 70-100 of these you use tons of GB's and my task are starting to die as they run out of hard drive space end the end kill the job.

I am running 0.19.1-dev, r744282. I have searched the issues but found nothing about the compression. Should the intermediate results not be compressed also if the map output files are set to be compressed? If not then why do we have the map compression option just to save network traffic?


Reply via email to