The file called Word.xps at:

http://www.wssdemo.com/XPS/Forms/AllItems.aspx

exhibits the problem.

You are entirely correct the entry is STORED so COMPRESS-100 does the trick
for me.

Simon 



On 12/03/2010 15:17, "Stefan Bodewig" <bode...@apache.org> wrote:

> On 2010-03-12, Simon Tyler <sty...@mimecast.net> wrote:
> 
>> If I explain the scenario in more detail then it might become clearer.
> 
>> I am seeing issues with certain zip files and file format based on zip (such
>> as docx and zip). We are reading these files from a stream so are using the
>> ZipArchiveInputStream.
> 
>> What I see is that we loop around getting each entry with getNextZipEntry
>> and we get a null and stop. All looks good. However we have only extracted 1
>> or 2 entries out of a known 20 or 30 entries - the file based extractor
>> extracts all the file.
> 
> Understood.  My guess is that whatever is creating your archives is
> using the optional header to identify data descriptors.  I'll try to
> create one with InfoZIP, can't promise anything, though.
> 
>> I cannot provide an example of a file as the examples I have are all
>> customer owned.
> 
> That's a pitty.
> 
>> However every xps file I have seen suffers the issue:
> 
> I just created one using the "Save as XPS" addin to Word 2007 on a
> "Hello world" document and the stream worked just fine.
> 
>> http://www.microsoft.com/whdc/xps/xpssampdoc.mspx
> 
> I'll take a look later, likely not today.
> 
>> I have investigated the issue and it is caused by entries that use the
>> central directory.
> 
> you mean data descriptor, right?
> 
>> What happens in the zip stream reader is that the size, csize and crc
>> fields are all zero, there is no central directory available to the
>> reader so it performs no extraction.
> 
> This is not true.  If the archiver works correctly it has set a flag
> that it is going to use a data descriptor after the entry's data.  If
> this flag has been set AND the compression method is DEFLATE, the stream
> can figure out itself where the entry data ends (since DEFLATE marks EOF
> internally).  If the entry data is STORED the stream cannot know where
> the data ends.
> 
> I see several problems while looking through the code:
> 
> * it doesn't verify the method is DEFLATE when a data descriptor is used
>   and it will try to read 0 bytes instead of throwing an exception -
>   this may be causing your problem.  COMPRESS-100
> 
> * the stream just skips over the data descriptor and never reads it - it
>   rather sets size and crc fields from what it has found.  This may be
>   OK since we never check the claimed CRC anyway.
> 
> * the stream skips over exactly four words while the archiver may have
>   used a signature of four bytes.  In that case the stream must skip
>   those extra bytes.  COMPRESS-101
> 
>> So my two change requests are simply to enable me to validate entries and
>> detect these types of stream so I can do something appropriate.
> 
> If I'm correct and you are bitten by what is now COMPRESS-100 then it
> should suffice if canReadEntryData returned false.  Right?
> 
>> The second request is to not return a null when this type of error occurs
>> but indicate the error somehow. There might be issues here (I am no zip
>> expert) but I would be worried about false errors being reported.
> 
> That could be COMPRESS-100 as well.  Or COMPRESS-101 is the problem for
> you, in which case we should be able to fix it.  Or it is yet another
> issue that we can't really identify without a testcase.
> 
> Stefan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to