[compress] ZIP - encoding of file names - again

Stefan Bodewig Fri, 13 Feb 2009 03:48:25 -0800

Let me try to capture the various threads in SANDBOX-176 and from this
list into something we can draw conclusions from.


First some background:
======================

when I implemented the ZIP classes for Ant, I was working from
InfoZIP's documentation of the format, not PKWARE's, I've now read the
later as well and learned a few new things.

In general the ZIP format as defined by PKWARE uses CodePage 437 for
filenames and the ZIP comment.  Initially it didn't say so explicitly,
this is just the way it was.  And of course CP437 isn't good enough
for arbitrary file names, so people simply started to use different
encodings - like java.util.zip which uses UTF-8.

Later revisions of the spec introduce a new flag that can be set to
indicate that a filename is encoded using UTF-8, the EFS flag.

According to Wolfgang's tests in SANDBOX-176 the flag is honored by
WinZIP and 7Zip while reading and I've checked that InfoZIP's unzip
5.x deals with them as well.  7Zip uses it for writing file names,
WinZIP doesn't and InfoZIP's zip 3.x may or may not write it depending
on compilation options.

InfoZIP introduces two new extra fields that can hold UTF-8 encoded
versions of the file names and the archive comments.  Extra fields in
ZIP archives hold additional data that are supposed to be ignored by
archivers that don't understand them.

Wolfgang's tests indicate that WinZip reads and writes the extra
field and 7Zip doesn't support them.  The InfoZIP tools certainly
support them.

Windows' built-in ZIP lib doesn't support either approach.

Supporting new extra fields is simple using the existing compress code
base, and supporting the EFS flag isn't that hard either - SANDBOX-176
contains the necessary code that really only needs to be tweaked to
conform to what we want to do by default.

Reading
=======

Let's keep ZipArchiveInputStream out of the discussion for now 8-)

I propose to change ZipFile to support both the EFS flag as well as
the InfoZIP extra fields when reading archives.

I'm not sure what ZipFile should do if it encounters both the EFS flag
and the extra fields.  Likely it is best to assume both hold the same
information and simply use the EFS encoded name.

The question is what ZipFile should assume as its default if neither
the EFS nor extra fields are present.  This can be controlled by
"setEncoding" right now and defaults to the platform's default
encoding but a default of UTF-8 (compatible with java.util.zip) or
CodePage 437 (compatible with formal ZIP spec) are valid choices as
well.

Writing
=======

I propose new flags get/setLanguageEncodingFlag for EFS and
get/setAddUnicodeExtraFields on ZipArchiveOutputStream that control
whether either approach is used.  I.e. I propose to optionally support
either approach (and both at the same time).

IMHO the main question is what the code should do by default.

Currently I think the best default approach would be to use UTF-8 as
the default encoding and set the EFS bit since this will create
archives compatible with java.util.zip but has the additional benefit
of clearly stating it is using UTF-8.

Note that using the EFS bit may make the archive unreadable for old
archivers, that's why we need the option to turn it off.

I wouldn't add the InfoZIP extra fields by default since they increase
the archve size.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

[compress] ZIP - encoding of file names - again

Reply via email to