Hi, I am learning hadoop. I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long position) method which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile when splitting SequenceFile into splits in MapReduce.
/** Seek to the next sync mark past a given position.*/public synchronized void sync(long position) throws IOException { if (position+SYNC_SIZE >= end) { seek(end); return; } try { seek(position+4); // skip escape in.readFully(syncCheck); int syncLen = sync.length; for (int i = 0; in.getPos() < end; i++) { int j = 0; for (; j < syncLen; j++) { if (sync[j] != syncCheck[(i+j)%syncLen]) break; } if (j == syncLen) { in.seek(in.getPos() - SYNC_SIZE); // position before sync return; } syncCheck[i%syncLen] = in.readByte(); } } catch (ChecksumException e) { // checksum failure handleChecksumException(e); }} According to my understanding, these codes simply look for a data sequence which contain the same data as "sync marker". My doubt: Consider a situation where the data in a SequenceFile happen to contain a 16 bytes data sequence the same as "sync marker", the codes above will mistakenly treat that 16-bytes data as a "sync marker" and then the SequenceFile won't be correctly parsed? I don't find any "escape" operation about the data or the sync marker. So, how can SequenceFile be binary safe? Am I missing something? Please correct me if I am wrong. Thanks! Shawn