Problem found while using LZO compression in Hadoop 0.20.1

李钰 Wed, 09 Jun 2010 03:00:02 -0700

Hi,

While using LZO compression to try to improve performance of my cluster, I
found that compression didn't work. The job I run is
"org.apache.hadoop.examples.Sort", with the input data generated by
"org.apache.hadoop.examples.RandomWriter".
I've made sure that I configured lzo native library/jar files right and set
all compression related parameters (such as "mapred.compress.map.output",
"mapred.output.compression.type", "mapred.output.compression.codec",
"mapred.output.compress" and "map.output.compression.codec"), and the
tasktracker did compress the map/job output through infomation got from job
logs. But the output file is not compressed at all!
Then I searched the internet, and found from
http://wiki.apache.org/hadoop/SequenceFile that in *SequenceFile Common
Header*, there're two bytes decided whether compression and block
compression tuned on for the file. I checked the sequece file generated by
RandomWriter, and the result is as follows:


[hdpad...@shihc008 rand-10mb]$ od -c part-00000 | head -n 15
0000000   S   E   Q 006   "   o   r   g   .   a   p   a   c   h   e   .
0000020   h   a   d   o   o   p   .   i   o   .   B   y   t   e   s   W
0000040   r   i   t   a   b   l   e   "   o   r   g   .   a   p   a   c
0000060   h   e   .   h   a   d   o   o   p   .   i   o   .   B   y   t
0000100   e   s   W   r   i   t   a   b   l   e  *\0  \0*  \0  \0  \0  \0
0000120 244   n   ! 177   L 316 030   q   g 035 351   L   ; 024 216 031
0000140  \0  \0  \t 234  \0  \0 001 305  \0  \0 001 301 207   v   5 255
0000160 220   ] 236   <  \b 367   &   9 241  \b   v 303   m 314 203 220
0000200 335  \0 241 325 232 035 037 267 303 360  \n 025   u   P 003 220
0000220   ^ 235 247 036   S 265 271 035   S 247   O   5 337   + 020   q
0000240 277   - 003 212   . 230 221   G 241   5   K   K 031 273 036 206
0000260   ( 317 303 367 351 214 364 262 340   S 211 230  \r 362   % 335
0000300   }   H   w   & 234   S   F 324 321 274   F 377   [ 344   [   h
0000320 204 001 265   ] 037   _   r   , 020 370 246 327 231 017 205 252
0000340 273 016 310   w 361 326 032 332 200   Y  \a   X 342  \r 016 364

I found the marked two bytes are set to zero, which meant tune off the
compression. And since the value of these two bytes are '\0', I guess this
may be a defect that we ignored to set these two bytes and this
makes sequece file generated by RandomWriter cannot be compressed.  And I
don't know whether this appears in other place.

Is my opinion right? If not, does anybody know what causes the compression
not working? Looking forward to your reply!

Thanks and Best Regards,
Carp

Problem found while using LZO compression in Hadoop 0.20.1

Reply via email to