Hi Chris,
thanks for your replay.
That's to say, SequenceFile is probabilistically binary safe. I notice a
jira issue attempting to support "append" in existing SequenceFile(
https://issues.apache.org/jira/browse/HADOOP-7139).  It occurred to me that
if some hacker reads the sync marker from the existing file and append some
elaborate data containing sync marker to the file,  the file may seem
corrupted when calculating splits while  nothing is wrong when we read the
SequenceFile sequentially.  However, this currently may not be a problem.
Thank you again !



2013/4/30 Chris Douglas <cdoug...@apache.org>

> You're not missing anything, but the probability of a 16 (thought it
> was 20?) byte collision with random bytes is vanishingly small. -C
>
> On Sat, Apr 27, 2013 at 4:30 AM, Hs <aswhol...@gmail.com> wrote:
> > Hi,
> >
> > I am learning hadoop.  I read the SequenceFile.java in hadoop-1.0.4
> source
> > codes. And I find the sync(long position) method which is used to find a
> > "sync marker" (a 16 bytes MD5 when generated at file creation time) in
> > SequenceFile when splitting SequenceFile into splits in MapReduce.
> >
> > /** Seek to the next sync mark past a given position.*/public
> > synchronized void sync(long position) throws IOException {
> >   if (position+SYNC_SIZE >= end) {
> >     seek(end);
> >     return;
> >   }
> >
> >   try {
> >     seek(position+4);                         // skip escape
> >     in.readFully(syncCheck);
> >     int syncLen = sync.length;
> >     for (int i = 0; in.getPos() < end; i++) {
> >       int j = 0;
> >       for (; j < syncLen; j++) {
> >         if (sync[j] != syncCheck[(i+j)%syncLen])
> >           break;
> >       }
> >       if (j == syncLen) {
> >         in.seek(in.getPos() - SYNC_SIZE);     // position before sync
> >         return;
> >       }
> >       syncCheck[i%syncLen] = in.readByte();
> >     }
> >   } catch (ChecksumException e) {             // checksum failure
> >     handleChecksumException(e);
> >   }}
> >
> > According to my understanding, these codes simply look for a data
> sequence
> > which contain the same data as "sync marker".
> >
> > My doubt:
> > Consider a situation where the data in a SequenceFile happen to contain a
> > 16 bytes data sequence the same as "sync marker", the codes above will
> > mistakenly treat that 16-bytes data as a "sync marker" and then the
> > SequenceFile won't be correctly parsed?
> >
> > I don't find any "escape" operation about the data or the sync marker.
> So,
> > how can SequenceFile be binary safe? Am I missing something? Please
> correct
> > me if I am wrong.
> >
> > Thanks!
> >
> > Shawn
>

Reply via email to