Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read fixed width records. In the next() method you would do something like:
byte[] buf = new byte[100]; IOUtils.readFully(in, buf, pos, 100); pos += 100; You would also need to check for the end of the stream. See LineRecordReader for some ideas. You'll also have to handle finding the start of records for a split, which you can do by looking at the offset and seeking to the next multiple of 100. If the RecordReader was a RecordReader<NullWritable, BytesWritable> (no keys) then it would return each record as a byte array to the mapper, which would then break it into fields. Alternatively, you could do it in the RecordReader, and use your own type which encapsulates the fields for the value. Hope this helps. Cheers, Tom On Thu, May 28, 2009 at 1:15 PM, Stuart White <stuart.whi...@gmail.com> wrote: > I need to process a dataset that contains text records of fixed length > in bytes. For example, each record may be 100 bytes in length, with > the first field being the first 10 bytes, the second field being the > second 10 bytes, etc... There are no newlines on the file. Field > values have been either whitespace-padded or truncated to fit within > the specific locations in these fixed-width records. > > Does Hadoop have an InputFormat to support processing of such files? > I looked but couldn't find one. > > Of course, I could pre-process the file (outside of Hadoop) to put > newlines at the end of each record, but I'd prefer not to require such > a prep step. > > Thanks. >