Sorry, I forget how much isn't clear to people who are just starting.
FileInputFormat creates FileSplits. The serialization is very stable
and can't be changed without breaking things. The reason that pipes
can't stringify it is that the string form of input splits are
ambiguous (and since it is user code, we really can't make assumptions
about it). The format of FileSplit is:
<16 bit filename byte length>
<filename in bytes>
<64 bit offset>
<64 bit length>
Technically the filename uses a funky utf-8 encoding, but in practice
as long as the filename has ascii characters they are ascii. Look at
org.apache.hadoop.io.UTF.writeString for the precise definition.
-- Owen