Hi -

I’m trying to use the Apache Tar package (1.8.2) for a Java program that tars 
large files in Hadoop. I am currently failing on a file that’s 17 GB long. Note 
that this code works without any problem for smaller files. I’m tarring smaller 
HDFS files all day long without any problem. It fails only when I have to tar 
that 17 GB file. I have a hard time making sense of the error message, after 
looking at source code for 3 days now... The exact file size at the time of the 
error is: 17456999265 bytes. The exception I’m seeing is:

12/19/11 5:54 PM [BDM.main] EXCEPTION request to write '65535' bytes exceeds 
size in header of '277130081' bytes
12/19/11 5:54 PM [BDM.main] EXCEPTION 
org.apache.tools.tar.TarOutputStream.write(TarOutputStream.java:238)
12/19/11 5:54 PM [BDM.main] EXCEPTION 
com.yahoo.ads.ngdstone.tpbdm.HDFSTar.archive(HDFSTar.java:149)

My code is:

           TarEntry entry = new TarEntry(p.getName());
           Path absolutePath = p.isAbsolute() ? p : new Path(baseDir, p); // 
HDFS Path
           FileStatus fileStatus = fs.getFileStatus(absolutePath); // HDFS 
fileStatus
           entry.setNames(fileStatus.getOwner(), fileStatus.getGroup());
           entry.setUserName(user);
           entry.setGroupName(group);
            entry.setName(name);
            entry.setSize(fileStatus.getLen());
            entry.setMode(Integer.parseInt("0100" + permissions, 8));
            out.putNextEntry(entry); // out = TarOutputStream

            if (fileStatus.getLen() > 0) {

                InputStream in = fs.open(absolutePath); // large file in HDFS

                try {

                    ++nEntries;

                    int bytesRead = in.read(buf);

                    while (bytesRead >= 0) {
                        out.write(buf, 0, bytesRead);
                        bytesRead = in.read(buf);
                    }

                } finally {
                    in.close();
                }
            }

            out.closeEntry();

Any idea? Am I missing anything in the way I’m setting up the TarOutputStream 
or TarEntry? Or does tar have implicit limits that are never going to work for 
multi-gigabytes size files?

Thanks!

Frank

Reply via email to