Aki Tanaka created HADOOP-14919:
-----------------------------------

             Summary: BZip2 drops records when reading data in splits
                 Key: HADOOP-14919
                 URL: https://issues.apache.org/jira/browse/HADOOP-14919
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Aki Tanaka


BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(001100010100000101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 00101111                                               /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to