Re: bug/deficiency in zip: non-ascii chars in file names work, but fail in directory names

Brent Sun, 02 Nov 2014 05:36:31 -0800

Doug Henderson wrote:
    "You need to add the -r option to recurse into directories:"



You are 100% correct; my oversight.


Actually, it was a copy and paste error: the real code that I want to test does 
use -r, but when I tried to adapt that code to a simpler format for my email, I 
accidentally dropped the -r.


The code that I really want to test fails with a different error, so you solved 
a mystery that was really bugging me: why the console code in my email behaved 
differently from the test code I really care about.



I returned to analysing my real test code more carefully, and I still see a 
problem with cygwin's unzip: it fails to extract zip files with unicode names 
that are produced by OTHER programs (i.e. some other program besides cygwin 
zip).


In particular, one part of my test code creates a zip archive using Java 
(ZipOutputStream and ZipEntry), and then confirms that the archive can be 
extracted and exactly reproduced by multiple other means.

The first extraction method is to again use Java (ZipFile and ZipEntry); this 
works perfectly, as it should.

The second extraction method is to use cygwin's unzip; this fails: IT MANGLES 
THE NAMES.  In particular:
    1) the directory should be åØâéñ (\u00E5\u00D8\u00E2\u00E9\u00F1)
    2) the file should be 㐀丁龦豈侮_file#2_length2048.txt (first 5 chars 
\u3400\u4E01\u9FA6\uF900\uFA30)
but what cygwin unzip actually produces during extraction is
    1) the directory is +++++
    2) the file is ڥǴ_file#2_length2048.txt

To rule out Java as being non-standard, I manually took the zip archive it 
produced and extracted it using the latest 7-zip (9.20), which worked perfectly 
(the directory and file names came out exact).  To further verify, I also 
temporarily installed the latest WinZip (19.0 build 11293) and once again, it 
extracted Java's zip file with non-ASCII names perfectly.  If anyone wants to 
verify these claims, I am attaching the zip file produced by Java (and 
extractable by 7zip and WinZip, but NOT by cygwin unzip) to this email.  
[UPDATE: my original email yesterday had this attachment, but I do not see it 
showing up on the mailing list.  I take it that cygwin mailing lists auto 
reject emails with attachments?]


So, I reckon that cygwin unzip is the odd man out.


Oh, when I try to view this zip file using Windows 7's integrated zip viewed in 
Windows Explorer, it displays mangled directory and file names that are 
something different still from what cygwin unzip produced.  This link
    
https://www.jam-software.com/treesize/online_manual/EN/unicode_zip_files.html

claims that Windows 7 does not really support unicode names, so this is perhaps 
expected.

Also, I found that this inter-program compatibility is limited to cygwin unzip: 
cygwin zip seems to produce archives involving unicode names that other 
programs can extract just fine.



I did some web research, and the most relevant link that I could find about 
cygwin unzip and unicode is this old announcement from 2009:
    https://cygwin.com/ml/cygwin-announce/2009-08/msg00006.html

That announcement contains this ominous text:
    Currently, on Windows the UTF-8 handling is limited to the character subset
    contained in the configured non-unicode "system code page".

Is it possible that the deficiency mentioned above has simply not been fixed in 
the last 5 years?

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

Re: bug/deficiency in zip: non-ascii chars in file names work, but fail in directory names

Reply via email to