Sorry, I forget how much isn't clear to people who are just starting.

FileInputFormat creates FileSplits. The serialization is very stable and can't be changed without breaking things. The reason that pipes can't stringify it is that the string form of input splits are ambiguous (and since it is user code, we really can't make assumptions about it). The format of FileSplit is:

<16 bit filename byte length>
<filename in bytes>
<64 bit offset>
<64 bit length>

Technically the filename uses a funky utf-8 encoding, but in practice as long as the filename has ascii characters they are ascii. Look at org.apache.hadoop.io.UTF.writeString for the precise definition.

-- Owen

Reply via email to