Re: intermediate results not getting compressed

Billy Pearson Thu, 19 Mar 2009 23:55:37 -0700

I opened a issue here
https://issues.apache.org/jira/browse/HADOOP-5539


If you would like to comment on it.

Billy

"Stefan Will" <stefan.w...@gmx.net> wrote in messagenews:c5e7dc6d.1840d%stefan.w...@gmx.net...

I noticed this too. I think the compression only applies to the finalmapper

and reducer outputs, but not any intermediate files produced. The reducer
will decompress the map output files after copying them, and then compress
its own output only after it has finished.

I wonder if this is by design, or just an oversight.

-- Stefan

From: Billy Pearson<sa...@pearsonwholesale.com>

Reply-To: <core-user@hadoop.apache.org>
Date: Wed, 18 Mar 2009 22:14:07 -0500
To: <core-user@hadoop.apache.org>
Subject: Re: intermediate results not getting compressed

I can run head on the map.out files and I get compressed garbish but Irunhead on a intermediate file and I can read the data in the file clearlysocompression is not getting passed but I am setting the CompressMapOutputto

true by default in my hadoop-site.conf file.

Billy


"Billy Pearson" <sa...@pearsonwholesale.com>
wrote in message news:gpscu3$66...@ger.gmane.org...

the intermediate.X files are not getting compresses for some reason  not
sure why
I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not

working correctly the compression is not passing from the map outputfiles

to the on disk merge of the intermediate.X files

tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 3 files left.

2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask:Keeping

3 segments, 39835369 bytes in memory for intermediate, on-disk merge

2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask:Merging

1730 files, 70359998581 bytes from disk

2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask:Merging

0 segments, 0 bytes from memory into reduce

2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging1733

sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1306

See the size of the files is about ~70GB (70359998581) these are
compressed at this points its went from 1733 file to 1306 left to merge

and the intermediate.X files are well over 200Gb at this point and weare

not even close to done. If compression is working we should not see task
failing at this point in the task becuase lack of hard drvie space sense
as we merge we delete the merged file from the output folder.

I only see this happening when there are to many files left that did not
get merged durring the shuffle stage and it starts on disk mergeing.

the task that complete the merges and keep it below the io.sort size inmy

case 30 skips the on disk merge and complete useing normal hard drive
space.

Anyone care to take a look?

This job takes two or more days to get to this point so getting kind ofa

pain in the butt to run and watch the reduces fail and the job keep
failing no matter what.

I can post the tail of this task long when it fails to show you how farit

gets before it runs out of space. before redcue on disk merge starts the
disk are about 35-40% used on 500GB Drive and two taks runnning at the
same time.

Billy Pearson

Re: intermediate results not getting compressed

Reply via email to