You're not missing anything, but the probability of a 16 (thought it was 20?) byte collision with random bytes is vanishingly small. -C
On Sat, Apr 27, 2013 at 4:30 AM, Hs <aswhol...@gmail.com> wrote: > Hi, > > I am learning hadoop. I read the SequenceFile.java in hadoop-1.0.4 source > codes. And I find the sync(long position) method which is used to find a > "sync marker" (a 16 bytes MD5 when generated at file creation time) in > SequenceFile when splitting SequenceFile into splits in MapReduce. > > /** Seek to the next sync mark past a given position.*/public > synchronized void sync(long position) throws IOException { > if (position+SYNC_SIZE >= end) { > seek(end); > return; > } > > try { > seek(position+4); // skip escape > in.readFully(syncCheck); > int syncLen = sync.length; > for (int i = 0; in.getPos() < end; i++) { > int j = 0; > for (; j < syncLen; j++) { > if (sync[j] != syncCheck[(i+j)%syncLen]) > break; > } > if (j == syncLen) { > in.seek(in.getPos() - SYNC_SIZE); // position before sync > return; > } > syncCheck[i%syncLen] = in.readByte(); > } > } catch (ChecksumException e) { // checksum failure > handleChecksumException(e); > }} > > According to my understanding, these codes simply look for a data sequence > which contain the same data as "sync marker". > > My doubt: > Consider a situation where the data in a SequenceFile happen to contain a > 16 bytes data sequence the same as "sync marker", the codes above will > mistakenly treat that 16-bytes data as a "sync marker" and then the > SequenceFile won't be correctly parsed? > > I don't find any "escape" operation about the data or the sync marker. So, > how can SequenceFile be binary safe? Am I missing something? Please correct > me if I am wrong. > > Thanks! > > Shawn