On 2010-03-12, Simon Tyler <sty...@mimecast.net> wrote:

> If I explain the scenario in more detail then it might become clearer.

> I am seeing issues with certain zip files and file format based on zip (such
> as docx and zip). We are reading these files from a stream so are using the
> ZipArchiveInputStream.

> What I see is that we loop around getting each entry with getNextZipEntry
> and we get a null and stop. All looks good. However we have only extracted 1
> or 2 entries out of a known 20 or 30 entries - the file based extractor
> extracts all the file.

Understood.  My guess is that whatever is creating your archives is
using the optional header to identify data descriptors.  I'll try to
create one with InfoZIP, can't promise anything, though.

> I cannot provide an example of a file as the examples I have are all
> customer owned.

That's a pitty.

> However every xps file I have seen suffers the issue:

I just created one using the "Save as XPS" addin to Word 2007 on a
"Hello world" document and the stream worked just fine.

> http://www.microsoft.com/whdc/xps/xpssampdoc.mspx

I'll take a look later, likely not today.

> I have investigated the issue and it is caused by entries that use the
> central directory.

you mean data descriptor, right?

> What happens in the zip stream reader is that the size, csize and crc
> fields are all zero, there is no central directory available to the
> reader so it performs no extraction.

This is not true.  If the archiver works correctly it has set a flag
that it is going to use a data descriptor after the entry's data.  If
this flag has been set AND the compression method is DEFLATE, the stream
can figure out itself where the entry data ends (since DEFLATE marks EOF
internally).  If the entry data is STORED the stream cannot know where
the data ends.

I see several problems while looking through the code:

* it doesn't verify the method is DEFLATE when a data descriptor is used
  and it will try to read 0 bytes instead of throwing an exception -
  this may be causing your problem.  COMPRESS-100

* the stream just skips over the data descriptor and never reads it - it
  rather sets size and crc fields from what it has found.  This may be
  OK since we never check the claimed CRC anyway.

* the stream skips over exactly four words while the archiver may have
  used a signature of four bytes.  In that case the stream must skip
  those extra bytes.  COMPRESS-101

> So my two change requests are simply to enable me to validate entries and
> detect these types of stream so I can do something appropriate.

If I'm correct and you are bitten by what is now COMPRESS-100 then it
should suffice if canReadEntryData returned false.  Right?

> The second request is to not return a null when this type of error occurs
> but indicate the error somehow. There might be issues here (I am no zip
> expert) but I would be worried about false errors being reported.

That could be COMPRESS-100 as well.  Or COMPRESS-101 is the problem for
you, in which case we should be able to fix it.  Or it is yet another
issue that we can't really identify without a testcase.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to