On 2010-03-12, Simon Tyler <sty...@mimecast.net> wrote: > If I explain the scenario in more detail then it might become clearer.
> I am seeing issues with certain zip files and file format based on zip (such > as docx and zip). We are reading these files from a stream so are using the > ZipArchiveInputStream. > What I see is that we loop around getting each entry with getNextZipEntry > and we get a null and stop. All looks good. However we have only extracted 1 > or 2 entries out of a known 20 or 30 entries - the file based extractor > extracts all the file. Understood. My guess is that whatever is creating your archives is using the optional header to identify data descriptors. I'll try to create one with InfoZIP, can't promise anything, though. > I cannot provide an example of a file as the examples I have are all > customer owned. That's a pitty. > However every xps file I have seen suffers the issue: I just created one using the "Save as XPS" addin to Word 2007 on a "Hello world" document and the stream worked just fine. > http://www.microsoft.com/whdc/xps/xpssampdoc.mspx I'll take a look later, likely not today. > I have investigated the issue and it is caused by entries that use the > central directory. you mean data descriptor, right? > What happens in the zip stream reader is that the size, csize and crc > fields are all zero, there is no central directory available to the > reader so it performs no extraction. This is not true. If the archiver works correctly it has set a flag that it is going to use a data descriptor after the entry's data. If this flag has been set AND the compression method is DEFLATE, the stream can figure out itself where the entry data ends (since DEFLATE marks EOF internally). If the entry data is STORED the stream cannot know where the data ends. I see several problems while looking through the code: * it doesn't verify the method is DEFLATE when a data descriptor is used and it will try to read 0 bytes instead of throwing an exception - this may be causing your problem. COMPRESS-100 * the stream just skips over the data descriptor and never reads it - it rather sets size and crc fields from what it has found. This may be OK since we never check the claimed CRC anyway. * the stream skips over exactly four words while the archiver may have used a signature of four bytes. In that case the stream must skip those extra bytes. COMPRESS-101 > So my two change requests are simply to enable me to validate entries and > detect these types of stream so I can do something appropriate. If I'm correct and you are bitten by what is now COMPRESS-100 then it should suffice if canReadEntryData returned false. Right? > The second request is to not return a null when this type of error occurs > but indicate the error somehow. There might be issues here (I am no zip > expert) but I would be worried about false errors being reported. That could be COMPRESS-100 as well. Or COMPRESS-101 is the problem for you, in which case we should be able to fix it. Or it is yet another issue that we can't really identify without a testcase. Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org