Lars Gustäbel <l...@gustaebel.de> added the comment:

I think it is a good suggestion to use "surrogateescape" as the default, 
because (I hope) it produces the fewest errors and is the best choice if 
tarfile is used in connection with Python's filesystem calls.

- When reading tar headers, undecodable chars in filenames end up as 
surrogates. This way no information is lost. In principle tarfile is merely a 
gateway to a filesystem inside an archive, so it feels natural if it treats 
filenames the same as Python's filesystem calls.

- When writing tar headers, filenames with surrogate chars (e.g. from 
os.listdir()) will be converted back to bytes in the header (in case of gnu and 
ustar formats). Filenames will remain unchanged, this is exactly as one would 
expect.

- When writing pax headers, filenames with surrogates will raise a UnicodeError 
because we may only use strict utf-8 inside a pax header. This is actually no 
difference to the status quo.

@Martin: As I understand it, the pax "invalid"-option is supposed to deal with 
the case when strings from a pax header are not representable in the user's 
encoding. In tarfile's case we don't have this problem when reading the archive 
until we try to extract it.

Unfortunately, POSIX says nothing about how to store bad filenames in a pax 
archive. tarfile raises an error. GNU tar fails silently, it just puts the 
unchanged original filename into the pax header without converting it to utf-8, 
thus violating the standard.

----------
Added file: http://bugs.python.org/file17227/tarfile_surrogates.2.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8390>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to