> Should the note be removed, or should it say something like "Unicode > file names are supported. New in Python 2.6."? Is there anything else > that should be mentioned?
The note should be corrected, documenting the behaviour implemented. > More on cp437: I see where you mentioned to the patch author that a > unicode string should be encoded in cp437 if possible, but this was > not done -- it first tries ascii. What are your views on what encoding > should be assumed if the utf8 flag is not set? There isn't any standard that is widely followed (just as the note that you declared bafflegab says). While APPNOTE.TXT specifies it as cp437, implementations often ignore that, because a) they didn't know, and b) cp437 was too limited for what they want to do. So we see all kinds of alternative implementations - often involving the locale's code page (and on Windows, both OEMCP and ACP get used - often just as a side effect of whatever internal representation the applications use). In 2.x, Python doesn't need to decide, so when opening a zip file, the file names get reported as byte strings unless they have the UTF-8 bit set (in which case they get decoded). In 3.x, file names (in the zipfile module) uniformly use the (unicode) character string type, hence that version implements the spec, by decoding as 437. Upon encoding, chosing between ASCII and CP437 has trade-offs. Notice how both are formally complying to the spec, as ASCII is a subset of CP437 (i.e. even though it uses the ASCII codec, it *still* encodes as CP437). The tradeoffs can be studied by looking at three groups of file names: - pure ASCII; choice does not matter (both ascii and cp437 can encode the file name, and both get the same result) - arbitrary string containing non-CP437 characters; choice does not matter (neither ascii nor cp437 can encode, so the UTF-8 bit must be used) - others; here are the tradeoffs. Pro ASCII: receiver can unambiguously reproduce the original file name, as the UTF-8 bit will be set. Pro CP437: old software (unaware of the UTF-8 bit) has a chance of correctly guessing the file name (if it followed APPNOTE.TXT). I (now) prefer the tradeoff being taken, as it's the one that produces more reliable results in the long run (i.e. when more and more zip readers support UTF-8). Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list