Re: InputFormat for fixed-width records?

Tom White Thu, 28 May 2009 06:49:27 -0700

Hi Stuart,

There isn't an InputFormat that comes with Hadoop to do this. Rather
than pre-processing the file, it would be better to implement your own
InputFormat. Subclass FileInputFormat and provide an implementation of
getRecordReader() that returns your implementation of RecordReader to
read fixed width records. In the next() method you would do something
like:

byte[] buf = new byte[100];
IOUtils.readFully(in, buf, pos, 100);
pos += 100;

You would also need to check for the end of the stream. See
LineRecordReader for some ideas. You'll also have to handle finding
the start of records for a split, which you can do by looking at the
offset and seeking to the next multiple of 100.

If the RecordReader was a RecordReader<NullWritable, BytesWritable>
(no keys) then it would return each record as a byte array to the
mapper, which would then break it into fields. Alternatively, you
could do it in the RecordReader, and use your own type which
encapsulates the fields for the value.

Hope this helps.

Cheers,
Tom

On Thu, May 28, 2009 at 1:15 PM, Stuart White <stuart.whi...@gmail.com> wrote:
> I need to process a dataset that contains text records of fixed length
> in bytes.  For example, each record may be 100 bytes in length, with
> the first field being the first 10 bytes, the second field being the
> second 10 bytes, etc...  There are no newlines on the file.  Field
> values have been either whitespace-padded or truncated to fit within
> the specific locations in these fixed-width records.
>
> Does Hadoop have an InputFormat to support processing of such files?
> I looked but couldn't find one.
>
> Of course, I could pre-process the file (outside of Hadoop) to put
> newlines at the end of each record, but I'd prefer not to require such
> a prep step.
>
> Thanks.
>

Re: InputFormat for fixed-width records?

Reply via email to