Hi Chris, Thanks for the report. I filed https://issues.apache.org/jira/browse/HADOOP-9667 for this.
Colin Software Engineer, Cloudera On Mon, Jun 24, 2013 at 2:20 AM, Christopher Ng <cng1...@gmail.com> wrote: > cross-posting this from cdh-users group where it received little interest: > > is there a bug in SequenceFile.sync()? This is from cdh4.3.0: > > /** Seek to the next sync mark past a given position.*/ > public synchronized void sync(long position) throws IOException { > if (position+SYNC_SIZE >= end) { > seek(end); > return; > } > > if (position < headerEnd) { > // seek directly to first record > in.seek(headerEnd); <==== > should this not call seek (ie this.seek) instead? > // note the sync marker "seen" in the header > syncSeen = true; > return; > } > > the problem is that when you sync to the start of a compressed file, the > noBufferedKeys and valuesDecompressed isn't reset so a block read isn't > triggered. When you subsequently call next() you're potentially getting > keys from the buffer which still contains keys from the previous position > of the file.