Hi Till,

just to clarify and for my understanding.

Let's assume we have the following Bzip2 file:

--------------------------------------------
|A.BA.B|A...B|A....|..BA.|...BA|....B|A...B|
--------------------------------------------
|1     |2    |3    |4    |5    |6    |7    | ("block number")

The FileInputFormat will need to split the bzip on the specific blocks
(marked with pipes here).

If I understood you correctly, every subtask (the InputFormat which gets
a FileSplit passed), will/should only see a part of the whole bzip.

That is fine for blocks where the records (an "A..B"-block) are within
the block's bounds.

I think I don't fully understand what happens when a record is split
between two or more blocks. Can a subtask, which for example handles
block 3, read into the fourth block to complete the record?

Cheers,
Sebastian

Reply via email to