Thanks for your reply. It clarifies a lot. The place i was not so sure is how to read the last record in a split, but now it seems there is no problem as filesystem has done it for me. :-)
On Tue, Jun 2, 2009 at 12:40 PM, Chuck Lam <chuck....@gmail.com> wrote: > Yes, it's totally possible for part of one record in the first file split > and the rest in the second file split. It's the job of the RecordReader to > make sure it's always reading in an entire record. Given a file split, your > RecordReader has to be able to skip over the first few bytes to get to the > first full record (if there's a partial record at the beginning). When it > reaches the end of the split, if there's a partial record there, it will go > get the rest of the record from the next split. > > Tom's email earlier in this thread explained some of the details. Like he > said, look at LineRecordReader for inspiration. The logic for figuring out > the start of the first full record is in LineRecordReader itself. The > RecordReader can read the last record (that spans two file splits) without > any special logic because the Hadoop filesystem abstracts away file split > boundaries when reading. > > > > On Mon, Jun 1, 2009 at 8:05 PM, Yabo-Arber Xu <arber.resea...@gmail.com > >wrote: > > > I have a follow-up question on this thread: How do we make sure that at > the > > getFileSplit phase, there is no records that cross the boundary of > > different > > file splits? > > > > To explain my point better, for example, if each of my record is 100 > bytes, > > would there be such case that there is some record whose key was put in > the > > 1st filesplit, while its value was put in the second split? > > > > Best, > > Arber > > > > On Thu, May 28, 2009 at 10:50 PM, Owen O'Malley <omal...@apache.org> > > wrote: > > > > > On May 28, 2009, at 5:15 AM, Stuart White wrote: > > > > > > I need to process a dataset that contains text records of fixed length > > >> in bytes. For example, each record may be 100 bytes in length > > >> > > > > > > The update to the terasort example has an InputFormat that does exactly > > > that. The key is 10 bytes and the value is the next 90 bytes. It is > > pretty > > > easy to write, but I should upload it soon. The output types are Text, > > but > > > they just have the binary data in them. > > > > > > -- Owen > > > > > >