I understand that I got CompressMapOutput set and it works the maps outputs
are compressed but on the reduce end it downloads x files then merges the x
file in to one intermediate file to keep the number of files to a minimal
<= io.sort.factor.
My problem is the output from merging the intermediate map output files is
not compresses so I lose all the benefit of compressing the map file output
to save disk space because the merged map output files are no longer
compressed.
Note there are two different type of intermediate files the map outputs then
one the reduce merges the map outputs to meet the set io.sort.factor.
Billy
----- Original Message -----
From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g...@public.gmane.org>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org>
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed
I am running 0.19.1-dev, r744282. I have searched the issues but found
nothing about the compression.
AFAIK, there are no open issues that prevent intermediate compression
from working. The following might be useful:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
Should the intermediate results not be compressed also if the map output
files are set to be compressed?
These are controlled by separate options.
FileOutputFormat::setCompressOutput enables/disables compression on the
final output
JobConf::setCompressMapOutput enables/disables compression of the
intermediate output
If not then why do we have the map compression option just to save
network traffic?
That's part of it. Also to save on disk bandwidth and intermediate
space. -C