It seems like we could just set up an escape sequence and make it actually binary-safe, rather than just probabilistically. The escape sequence would only be inserted when there would otherwise be confusion between data and a sync marker.
best, Colin On Thu, May 2, 2013 at 3:26 AM, Hs <aswhol...@gmail.com> wrote: > Hi Chris, > thanks for your replay. > That's to say, SequenceFile is probabilistically binary safe. I notice a > jira issue attempting to support "append" in existing SequenceFile( > https://issues.apache.org/jira/browse/HADOOP-7139). It occurred to me > that > if some hacker reads the sync marker from the existing file and append some > elaborate data containing sync marker to the file, the file may seem > corrupted when calculating splits while nothing is wrong when we read the > SequenceFile sequentially. However, this currently may not be a problem. > Thank you again ! > > > > 2013/4/30 Chris Douglas <cdoug...@apache.org> > > > You're not missing anything, but the probability of a 16 (thought it > > was 20?) byte collision with random bytes is vanishingly small. -C > > > > On Sat, Apr 27, 2013 at 4:30 AM, Hs <aswhol...@gmail.com> wrote: > > > Hi, > > > > > > I am learning hadoop. I read the SequenceFile.java in hadoop-1.0.4 > > source > > > codes. And I find the sync(long position) method which is used to find > a > > > "sync marker" (a 16 bytes MD5 when generated at file creation time) in > > > SequenceFile when splitting SequenceFile into splits in MapReduce. > > > > > > /** Seek to the next sync mark past a given position.*/public > > > synchronized void sync(long position) throws IOException { > > > if (position+SYNC_SIZE >= end) { > > > seek(end); > > > return; > > > } > > > > > > try { > > > seek(position+4); // skip escape > > > in.readFully(syncCheck); > > > int syncLen = sync.length; > > > for (int i = 0; in.getPos() < end; i++) { > > > int j = 0; > > > for (; j < syncLen; j++) { > > > if (sync[j] != syncCheck[(i+j)%syncLen]) > > > break; > > > } > > > if (j == syncLen) { > > > in.seek(in.getPos() - SYNC_SIZE); // position before sync > > > return; > > > } > > > syncCheck[i%syncLen] = in.readByte(); > > > } > > > } catch (ChecksumException e) { // checksum failure > > > handleChecksumException(e); > > > }} > > > > > > According to my understanding, these codes simply look for a data > > sequence > > > which contain the same data as "sync marker". > > > > > > My doubt: > > > Consider a situation where the data in a SequenceFile happen to > contain a > > > 16 bytes data sequence the same as "sync marker", the codes above will > > > mistakenly treat that 16-bytes data as a "sync marker" and then the > > > SequenceFile won't be correctly parsed? > > > > > > I don't find any "escape" operation about the data or the sync marker. > > So, > > > how can SequenceFile be binary safe? Am I missing something? Please > > correct > > > me if I am wrong. > > > > > > Thanks! > > > > > > Shawn > > >