Hi Till, just to clarify and for my understanding.
Let's assume we have the following Bzip2 file: -------------------------------------------- |A.BA.B|A...B|A....|..BA.|...BA|....B|A...B| -------------------------------------------- |1 |2 |3 |4 |5 |6 |7 | ("block number") The FileInputFormat will need to split the bzip on the specific blocks (marked with pipes here). If I understood you correctly, every subtask (the InputFormat which gets a FileSplit passed), will/should only see a part of the whole bzip. That is fine for blocks where the records (an "A..B"-block) are within the block's bounds. I think I don't fully understand what happens when a record is split between two or more blocks. Can a subtask, which for example handles block 3, read into the fourth block to complete the record? Cheers, Sebastian