Re: Is Hadoop SequenceFile binary safe?

Colin McCabe Thu, 02 May 2013 11:29:35 -0700

It seems like we could just set up an escape sequence and make it actually
binary-safe, rather than just probabilistically.  The escape sequence would
only be inserted when there would otherwise be confusion between data and a
sync marker.


best,
Colin


On Thu, May 2, 2013 at 3:26 AM, Hs <aswhol...@gmail.com> wrote:

> Hi Chris,
> thanks for your replay.
> That's to say, SequenceFile is probabilistically binary safe. I notice a
> jira issue attempting to support "append" in existing SequenceFile(
> https://issues.apache.org/jira/browse/HADOOP-7139).  It occurred to me
> that
> if some hacker reads the sync marker from the existing file and append some
> elaborate data containing sync marker to the file,  the file may seem
> corrupted when calculating splits while  nothing is wrong when we read the
> SequenceFile sequentially.  However, this currently may not be a problem.
> Thank you again !
>
>
>
> 2013/4/30 Chris Douglas <cdoug...@apache.org>
>
> > You're not missing anything, but the probability of a 16 (thought it
> > was 20?) byte collision with random bytes is vanishingly small. -C
> >
> > On Sat, Apr 27, 2013 at 4:30 AM, Hs <aswhol...@gmail.com> wrote:
> > > Hi,
> > >
> > > I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4
> > source
> > > codes. And I find the sync(long position) method which is used to find
> a
> > > "sync marker" (a 16 bytes MD5 when generated at file creation time) in
> > > SequenceFile when splitting SequenceFile into splits in MapReduce.
> > >
> > > /** Seek to the next sync mark past a given position.*/public
> > > synchronized void sync(long position) throws IOException {
> > >   if (position+SYNC_SIZE >= end) {
> > >     seek(end);
> > >     return;
> > >   }
> > >
> > >   try {
> > >     seek(position+4);                         // skip escape
> > >     in.readFully(syncCheck);
> > >     int syncLen = sync.length;
> > >     for (int i = 0; in.getPos() < end; i++) {
> > >       int j = 0;
> > >       for (; j < syncLen; j++) {
> > >         if (sync[j] != syncCheck[(i+j)%syncLen])
> > >           break;
> > >       }
> > >       if (j == syncLen) {
> > >         in.seek(in.getPos() - SYNC_SIZE);     // position before sync
> > >         return;
> > >       }
> > >       syncCheck[i%syncLen] = in.readByte();
> > >     }
> > >   } catch (ChecksumException e) {             // checksum failure
> > >     handleChecksumException(e);
> > >   }}
> > >
> > > According to my understanding, these codes simply look for a data
> > sequence
> > > which contain the same data as "sync marker".
> > >
> > > My doubt:
> > > Consider a situation where the data in a SequenceFile happen to
> contain a
> > > 16 bytes data sequence the same as "sync marker", the codes above will
> > > mistakenly treat that 16-bytes data as a "sync marker" and then the
> > > SequenceFile won't be correctly parsed?
> > >
> > > I don't find any "escape" operation about the data or the sync marker.
> > So,
> > > how can SequenceFile be binary safe? Am I missing something? Please
> > correct
> > > me if I am wrong.
> > >
> > > Thanks!
> > >
> > > Shawn
> >
>

Re: Is Hadoop SequenceFile binary safe?

Reply via email to