Hi all, over the past two weeks commons-compress has been adding stuff for more advanced ZIP features and I've merged the changes over to our zip package. The changes bring two new options with them and I'd like to get some feedback as to which defaults our tasks should use wrt these options.
First some background: Traditionally file names are encoded using Windows CodePage 437 inside ZIP archives. This is insufficient for many characters and thus people have chosen multiple incompatible ways to use different encodings. jar uses UTF-8. Ant's tasks provide options to set the encoding when reading/writing archives and defaults to the platform's default encoding for zip/unzip or UTF-8 for jar/unjar. Now the new stuff. Language Encoding Flag ---------------------- PKWARE as the definer of the ZIP standard have desiganted a bit inside the "general purpose bits" part of the entry's metadata to say "my file name is in UTF-8". This flag is recognized by more modern PKWARE archivers, 7ZIP and very recent InfoZIP tools (if compiled using the correct options). 7ZIP creates archives using that flag. WinZIP and Windows' "compressed folders" completely ignore the flag. The ZipOutputStream code right now sets the flag if encoding is UTF-8 (i.e. we are writing JARs) which makes those who understand it immediately pick up the correct file names. Those who don't know the flag are no better off than before - java.util.zip seems to be happy with and without the flag. The ZipFile code right now recognizes the flag and ignores any explicitly specified encoding if the flag is set - and uses UTF-8 instead, assuming the archiver knew what it has been doing. I think either are fine defaults and I'm not even sure we need to make them user configurable on the reading side. We may add an option on the writing side if there is some rare archiver that chokes on an unknown bit in the general purpose bit area. InfoZip Unicode Extra Fields ---------------------------- The InfoZIP folks have defined new ZIP extra fields that store UTF-8 versions of file names and comments in the entry's metadata - no matter what the encoding of the normal name and comment fields may be. PKWARE and WinZIP recognize these extra fields, 7ZIP and Windows' "compressed folders" ignore them. WinZIP creates archives using them (but we won't benefit from that unless we fix <https://issues.apache.org/bugzilla/show_bug.cgi?id=46637>). For maximum interop it may be a good idea to write the extra fields, but it will make the archives bigger. That's why the current ZipOutputStream doesn't write them by default - but it can be told to do so. ZipFile currently ignores the extra fields by default but can be told to look for them. It will ignore them if the language encoding flag has been set. It may be a good idea to look for the extra fields by default since it really doesn't cost too much. Defaults? --------- I want to add new flags to <zip> and <unzip> (and thus the subclasses). <zip>: * setLanguageEncodingFlag - doesn't do anything if the encoding is not UTF-8. Controls whether ZipOutputStream sets the flag. I'd make that default to true. * createUnicodeExtraFields Controls whether ZipOutputStream writes Unicode extra fields. I'd make that default to false. <unzip>: * parseUnicodeExtraFields Controls whether ZipFile searches for Unicode extra fields. I'm uncertain as to what the default should be. Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ant.apache.org For additional commands, e-mail: dev-h...@ant.apache.org