The file called Word.xps at: http://www.wssdemo.com/XPS/Forms/AllItems.aspx
exhibits the problem. You are entirely correct the entry is STORED so COMPRESS-100 does the trick for me. Simon On 12/03/2010 15:17, "Stefan Bodewig" <bode...@apache.org> wrote: > On 2010-03-12, Simon Tyler <sty...@mimecast.net> wrote: > >> If I explain the scenario in more detail then it might become clearer. > >> I am seeing issues with certain zip files and file format based on zip (such >> as docx and zip). We are reading these files from a stream so are using the >> ZipArchiveInputStream. > >> What I see is that we loop around getting each entry with getNextZipEntry >> and we get a null and stop. All looks good. However we have only extracted 1 >> or 2 entries out of a known 20 or 30 entries - the file based extractor >> extracts all the file. > > Understood. My guess is that whatever is creating your archives is > using the optional header to identify data descriptors. I'll try to > create one with InfoZIP, can't promise anything, though. > >> I cannot provide an example of a file as the examples I have are all >> customer owned. > > That's a pitty. > >> However every xps file I have seen suffers the issue: > > I just created one using the "Save as XPS" addin to Word 2007 on a > "Hello world" document and the stream worked just fine. > >> http://www.microsoft.com/whdc/xps/xpssampdoc.mspx > > I'll take a look later, likely not today. > >> I have investigated the issue and it is caused by entries that use the >> central directory. > > you mean data descriptor, right? > >> What happens in the zip stream reader is that the size, csize and crc >> fields are all zero, there is no central directory available to the >> reader so it performs no extraction. > > This is not true. If the archiver works correctly it has set a flag > that it is going to use a data descriptor after the entry's data. If > this flag has been set AND the compression method is DEFLATE, the stream > can figure out itself where the entry data ends (since DEFLATE marks EOF > internally). If the entry data is STORED the stream cannot know where > the data ends. > > I see several problems while looking through the code: > > * it doesn't verify the method is DEFLATE when a data descriptor is used > and it will try to read 0 bytes instead of throwing an exception - > this may be causing your problem. COMPRESS-100 > > * the stream just skips over the data descriptor and never reads it - it > rather sets size and crc fields from what it has found. This may be > OK since we never check the claimed CRC anyway. > > * the stream skips over exactly four words while the archiver may have > used a signature of four bytes. In that case the stream must skip > those extra bytes. COMPRESS-101 > >> So my two change requests are simply to enable me to validate entries and >> detect these types of stream so I can do something appropriate. > > If I'm correct and you are bitten by what is now COMPRESS-100 then it > should suffice if canReadEntryData returned false. Right? > >> The second request is to not return a null when this type of error occurs >> but indicate the error somehow. There might be issues here (I am no zip >> expert) but I would be worried about false errors being reported. > > That could be COMPRESS-100 as well. Or COMPRESS-101 is the problem for > you, in which case we should be able to fix it. Or it is yet another > issue that we can't really identify without a testcase. > > Stefan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org