Let me try to capture the various threads in SANDBOX-176 and from this list into something we can draw conclusions from.
First some background: ====================== when I implemented the ZIP classes for Ant, I was working from InfoZIP's documentation of the format, not PKWARE's, I've now read the later as well and learned a few new things. In general the ZIP format as defined by PKWARE uses CodePage 437 for filenames and the ZIP comment. Initially it didn't say so explicitly, this is just the way it was. And of course CP437 isn't good enough for arbitrary file names, so people simply started to use different encodings - like java.util.zip which uses UTF-8. Later revisions of the spec introduce a new flag that can be set to indicate that a filename is encoded using UTF-8, the EFS flag. According to Wolfgang's tests in SANDBOX-176 the flag is honored by WinZIP and 7Zip while reading and I've checked that InfoZIP's unzip 5.x deals with them as well. 7Zip uses it for writing file names, WinZIP doesn't and InfoZIP's zip 3.x may or may not write it depending on compilation options. InfoZIP introduces two new extra fields that can hold UTF-8 encoded versions of the file names and the archive comments. Extra fields in ZIP archives hold additional data that are supposed to be ignored by archivers that don't understand them. Wolfgang's tests indicate that WinZip reads and writes the extra field and 7Zip doesn't support them. The InfoZIP tools certainly support them. Windows' built-in ZIP lib doesn't support either approach. Supporting new extra fields is simple using the existing compress code base, and supporting the EFS flag isn't that hard either - SANDBOX-176 contains the necessary code that really only needs to be tweaked to conform to what we want to do by default. Reading ======= Let's keep ZipArchiveInputStream out of the discussion for now 8-) I propose to change ZipFile to support both the EFS flag as well as the InfoZIP extra fields when reading archives. I'm not sure what ZipFile should do if it encounters both the EFS flag and the extra fields. Likely it is best to assume both hold the same information and simply use the EFS encoded name. The question is what ZipFile should assume as its default if neither the EFS nor extra fields are present. This can be controlled by "setEncoding" right now and defaults to the platform's default encoding but a default of UTF-8 (compatible with java.util.zip) or CodePage 437 (compatible with formal ZIP spec) are valid choices as well. Writing ======= I propose new flags get/setLanguageEncodingFlag for EFS and get/setAddUnicodeExtraFields on ZipArchiveOutputStream that control whether either approach is used. I.e. I propose to optionally support either approach (and both at the same time). IMHO the main question is what the code should do by default. Currently I think the best default approach would be to use UTF-8 as the default encoding and set the EFS bit since this will create archives compatible with java.util.zip but has the additional benefit of clearly stating it is using UTF-8. Note that using the EFS bit may make the archive unreadable for old archivers, that's why we need the option to turn it off. I wouldn't add the InfoZIP extra fields by default since they increase the archve size. Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org