from:"John Goerzen"

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-19 Thread John Goerzen



New submission from John Goerzen :

The zipfile.py standard library component contains a number of pieces of 
questionable handling of non-UTF8 filenames.  As the ZIP file format predated 
Unicode by a significant number of years, this is actually fairly common with 
older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" 
displays:

  74 65 73 74 f7 2e 74 78  74 0a|test..txt.|
000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a 
Unicode sequence; "ls | hd" now displays:

  74 65 73 74 e2 89 88 2e  74 78 74 0a  |testtxt.|
000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile 
interface is broken; it should not try to decode filenames into strings and 
should instead treat them as bytes and leave potential decoding up to 
applications.  It appears to try, down various code paths, to decode filenames 
as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix 
systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  
In short, there is no way that zipfile.py can reliably guess the encoding of a 
filename in a ZIP file, so it is a data-loss bug that it attempts and fails to 
do so.  It is a further bug that extractall mangles filenames; unzip(1) is 
perfectly capable of extracting these files correctly.  I'm attaching this zip 
file for reference.

At the very least, zipfile should provide a bytes interface for filenames for 
people that care about correctness.

--
files: t.zip
messages: 357023
nosy: jgoerzen
priority: normal
severity: normal
status: open
title: zipfile: Corrupts filenames containing non-UTF8 characters
type: behavior
Added file: https://bugs.python.org/file48724/t.zip

___
Python tracker 
<https://bugs.python.org/issue38861>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38864] dbm: Can't open database with bytes-encoded filename

2019-11-20 Thread John Goerzen



New submission from John Goerzen :

This simple recipe fails:

>>> import dbm
>>> dbm.open(b"foo")
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.7/dbm/__init__.py", line 78, in open
result = whichdb(file) if 'n' not in flag else None
  File "/usr/lib/python3.7/dbm/__init__.py", line 112, in whichdb
f = io.open(filename + ".pag", "rb")
TypeError: can't concat str to bytes

Why does this matter?  On POSIX, a filename is any string of bytes that does 
not contain 0x00 or '/'.  A database with a filename containing, for instance, 
German characters in ISO-8859-1, can't be opened by dbm, EVEN WITH decoding.

For instance:

file = b"test\xf7"
>>> dbm.open(file.decode())
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 4: invalid 
start byte
db = dbm.open(file.decode('iso-8859-1'), 'c')
db.close()

Then:

ls *.db | hd
  74 65 73 74 c3 b7 2e 64  62 0a|test...db.|
000a

Note that it didn't insert the 0xf7 here; rather, it inserted the Unicode 
sequence corresponding to the division character (which is what 0xf7 in 
iso-8859-1 is).  It is not possible to open a filename named "test\xf7.db" with 
the dbm module.

--
messages: 357078
nosy: jgoerzen
priority: normal
severity: normal
status: open
title: dbm: Can't open database with bytes-encoded filename

___
Python tracker 
<https://bugs.python.org/issue38864>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38864] dbm: Can't open database with bytes-encoded filename

2019-11-20 Thread John Goerzen



John Goerzen  added the comment:

As has been pointed out to me, the surrogateescape method could be used here; 
however, it is a bit of an odd duckling itself, and the system's open() call 
accepts bytes; couldn't this as well?

--

___
Python tracker 
<https://bugs.python.org/issue38864>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-24 Thread John Goerzen



John Goerzen  added the comment:

I can tell you that the zip(1) on Unix systems has never done re-encoding to 
cp437; on a system that uses latin-1 (or any other latin-* for that matter) the 
filenames in the ZIP will be encoded in latin-1.  Furthermore, this doesn't 
explain the corruption that extractall() causes.

--

___
Python tracker 
<https://bugs.python.org/issue38861>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-25 Thread John Goerzen



John Goerzen  added the comment:

Hi Jon,

I've read your article in the gist, the ZIP spec, and the article you linked 
to.  As the article you linked to 
(https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, 
"Implementers just encode file names however they want (usually byte for byte 
as they are in the OS".  That is certainly my observation.  CP437 has NEVER 
been guaranteed, *even on DOS*.  See 
https://en.wikipedia.org/wiki/Category:DOS_code_pages and 
https://www.aivosto.com/articles/charsets-codepages-dos.html for details on DOS 
code pages.  I do not recall any translation between DOS codepages being done 
in practice, or even possible - since the whole point of multiple codepages was 
the need for more than 256 symbols.  So (leaving aside utf-8 encodings for a 
second) no operating system or ZIP implementation I am aware of performs a 
translation to cp437, such translation is often not even possible, and they're 
just copying literal bytes to ZIP -- as the POSIX filesystem itself is.

So, from the above paragraph, it's clear that the assumption in zipfile that 
cp437 is in use is faulty.  Your claim that Python "fixes" a problem is also 
faulty.  Converting from a latin-1 character, using a cp437 codeset, and 
generating a filename with that cp437 character represented as a Unicode code 
point is wrong in many ways.  Python should not take an opinion on this; it 
should be agnostic and copy the bytes that represent the filename in the ZIP to 
bytes that represent the filename on the filesystem.

POSIX filenames contain any of 254 characters (only 0x00 and '/' are invalid).  
The filesystem is encoding-agnostic; POSIX filenames are just stream of bytes.  
There is no alternative but to treat ZIP filenames (without the Unicode flag) 
the same way.  Copy bytes to bytes.  It is not possible to identify the 
encoding of the filename in the absence of the Unicode flag.

zipfile should:

1) expose a bytes interface to filename
2) use byte-for-byte extraction when no Unicode flag is present
3) not make the assumption that cp437 was the original encoding

Your proposal only "works" cross-platform because it is broken on every 
platform!

--

___
Python tracker 
<https://bugs.python.org/issue38861>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

[issue38864] dbm: Can't open database with bytes-encoded filename

[issue38864] dbm: Can't open database with bytes-encoded filename

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

5 matches

Site Navigation

Mail list logo

Footer information