[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters
New submission from John Goerzen : The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames. As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code. Here is a very simple reproduction case. mkdir t cd t echo hi > `printf 'test\xf7.txt'` cd .. zip -9r t.zip t 0xf7 is the division sign in ISO-8859-1. In the "t" directory, "ls | hd" displays: 74 65 73 74 f7 2e 74 78 74 0a|test..txt.| 000a Now, here's a simple Python3 program: import zipfile z = zipfile.ZipFile("t.zip") z.extractall() If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls | hd" now displays: 74 65 73 74 e2 89 88 2e 74 78 74 0a |testtxt.| 000c The impact within Python programs is equally bad. Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications. It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8. However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common). In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so. It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly. I'm attaching this zip file for reference. At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness. -- files: t.zip messages: 357023 nosy: jgoerzen priority: normal severity: normal status: open title: zipfile: Corrupts filenames containing non-UTF8 characters type: behavior Added file: https://bugs.python.org/file48724/t.zip ___ Python tracker <https://bugs.python.org/issue38861> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38864] dbm: Can't open database with bytes-encoded filename
New submission from John Goerzen : This simple recipe fails: >>> import dbm >>> dbm.open(b"foo") Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.7/dbm/__init__.py", line 78, in open result = whichdb(file) if 'n' not in flag else None File "/usr/lib/python3.7/dbm/__init__.py", line 112, in whichdb f = io.open(filename + ".pag", "rb") TypeError: can't concat str to bytes Why does this matter? On POSIX, a filename is any string of bytes that does not contain 0x00 or '/'. A database with a filename containing, for instance, German characters in ISO-8859-1, can't be opened by dbm, EVEN WITH decoding. For instance: file = b"test\xf7" >>> dbm.open(file.decode()) Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 4: invalid start byte db = dbm.open(file.decode('iso-8859-1'), 'c') db.close() Then: ls *.db | hd 74 65 73 74 c3 b7 2e 64 62 0a|test...db.| 000a Note that it didn't insert the 0xf7 here; rather, it inserted the Unicode sequence corresponding to the division character (which is what 0xf7 in iso-8859-1 is). It is not possible to open a filename named "test\xf7.db" with the dbm module. -- messages: 357078 nosy: jgoerzen priority: normal severity: normal status: open title: dbm: Can't open database with bytes-encoded filename ___ Python tracker <https://bugs.python.org/issue38864> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38864] dbm: Can't open database with bytes-encoded filename
John Goerzen added the comment: As has been pointed out to me, the surrogateescape method could be used here; however, it is a bit of an odd duckling itself, and the system's open() call accepts bytes; couldn't this as well? -- ___ Python tracker <https://bugs.python.org/issue38864> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters
John Goerzen added the comment: I can tell you that the zip(1) on Unix systems has never done re-encoding to cp437; on a system that uses latin-1 (or any other latin-* for that matter) the filenames in the ZIP will be encoded in latin-1. Furthermore, this doesn't explain the corruption that extractall() causes. -- ___ Python tracker <https://bugs.python.org/issue38861> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters
John Goerzen added the comment: Hi Jon, I've read your article in the gist, the ZIP spec, and the article you linked to. As the article you linked to (https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, "Implementers just encode file names however they want (usually byte for byte as they are in the OS". That is certainly my observation. CP437 has NEVER been guaranteed, *even on DOS*. See https://en.wikipedia.org/wiki/Category:DOS_code_pages and https://www.aivosto.com/articles/charsets-codepages-dos.html for details on DOS code pages. I do not recall any translation between DOS codepages being done in practice, or even possible - since the whole point of multiple codepages was the need for more than 256 symbols. So (leaving aside utf-8 encodings for a second) no operating system or ZIP implementation I am aware of performs a translation to cp437, such translation is often not even possible, and they're just copying literal bytes to ZIP -- as the POSIX filesystem itself is. So, from the above paragraph, it's clear that the assumption in zipfile that cp437 is in use is faulty. Your claim that Python "fixes" a problem is also faulty. Converting from a latin-1 character, using a cp437 codeset, and generating a filename with that cp437 character represented as a Unicode code point is wrong in many ways. Python should not take an opinion on this; it should be agnostic and copy the bytes that represent the filename in the ZIP to bytes that represent the filename on the filesystem. POSIX filenames contain any of 254 characters (only 0x00 and '/' are invalid). The filesystem is encoding-agnostic; POSIX filenames are just stream of bytes. There is no alternative but to treat ZIP filenames (without the Unicode flag) the same way. Copy bytes to bytes. It is not possible to identify the encoding of the filename in the absence of the Unicode flag. zipfile should: 1) expose a bytes interface to filename 2) use byte-for-byte extraction when no Unicode flag is present 3) not make the assumption that cp437 was the original encoding Your proposal only "works" cross-platform because it is broken on every platform! -- ___ Python tracker <https://bugs.python.org/issue38861> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com