My problem is the output from merging the intermediate map output
files is not compresses so I lose all the benefit of compressing the
map file output to save disk space because the merged map output
files are no longer compressed.
It should still be compressed, unless there's some bizarre regression.
More segments will be around simultaneously (since the segments not
yet merged are still on disk), which clearly puts pressure on
intermediate storage, but if the map outputs are compressed, then the
merged map outputs at the reduce must also be compressed. There's no
place in the intermediate format to store compression metadata, so
either all are or none are. Intermediate merges should also follow the
compression spec of the initiating merger, too (o.a.h.mapred.Merger:
447).
How are you concluding that the intermediate output is compressed from
the map, but not in the reduce? -C
----- Original Message ----- From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g...@public.gmane.org
>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org>
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed
I am running 0.19.1-dev, r744282. I have searched the issues but
found nothing about the compression.
AFAIK, there are no open issues that prevent intermediate
compression from working. The following might be useful:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
Should the intermediate results not be compressed also if the map
output files are set to be compressed?
These are controlled by separate options.
FileOutputFormat::setCompressOutput enables/disables compression
on the final output
JobConf::setCompressMapOutput enables/disables compression of the
intermediate output
If not then why do we have the map compression option just to save
network traffic?
That's part of it. Also to save on disk bandwidth and intermediate
space. -C