Hi -
I’m trying to use the Apache Tar package (1.8.2) for a Java program that tars
large files in Hadoop. I am currently failing on a file that’s 17 GB long. Note
that this code works without any problem for smaller files. I’m tarring smaller
HDFS files all day long without any problem. It fails only when I have to tar
that 17 GB file. I have a hard time making sense of the error message, after
looking at source code for 3 days now... The exact file size at the time of the
error is: 17456999265 bytes. The exception I’m seeing is:
12/19/11 5:54 PM [BDM.main] EXCEPTION request to write '65535' bytes exceeds
size in header of '277130081' bytes
12/19/11 5:54 PM [BDM.main] EXCEPTION
org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:238)
12/19/11 5:54 PM [BDM.main] EXCEPTION
com.yahoo.ads.ngdstone.tpbdm.HDFSTar.archive(HDFSTar.java:149)
My code is:
TarEntry entry = new TarEntry(p.getName());
Path absolutePath = p.isAbsolute() ? p : new Path(baseDir, p); //
HDFS Path
FileStatus fileStatus = fs.getFileStatus(absolutePath); // HDFS
fileStatus
entry.setNames(fileStatus.getOwner(), fileStatus.getGroup());
entry.setUserName(user);
entry.setGroupName(group);
entry.setName(name);
entry.setSize(fileStatus.getLen());
entry.setMode(Integer.parseInt("0100" + permissions, 8));
out.putNextEntry(entry); // out = TarOutputStream
if (fileStatus.getLen() > 0) {
InputStream in = fs.open(absolutePath); // large file in HDFS
try {
++nEntries;
int bytesRead = in.read(buf);
while (bytesRead >= 0) {
out.write(buf, 0, bytesRead);
bytesRead = in.read(buf);
}
} finally {
in.close();
}
}
out.closeEntry();
Any idea? Am I missing anything in the way I’m setting up the TarOutputStream
or TarEntry? Or does tar have implicit limits that are never going to work for
multi-gigabytes size files?
Thanks!
Frank