New submission from Alexey Izbyshev <izbys...@ispras.ru>: If a format string contains code points outside of ASCII range, time.strftime() can behave in four different ways depending on the platform, the current locale and the code points:
* raise a UnicodeEncodeError * return an empty string * for surrogates in \uDC80-\uDCFF range, replace them with different code points in the output (potentially mangling nearby parts of the output as well) * round-trip them correctly Some examples: * Linux (glibc 2.27): Python 3.6.4 (default, Jan 03 2018, 13:52:55) [GCC] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import time, locale >>> locale.getlocale() ('en_US', 'UTF-8') >>> time.strftime('\x80') '\x80' >>> time.strftime('\u044f') 'я' # '\u044f' >>> time.strftime('\ud800') '\ud800' >>> time.strftime('\udcff') '\udcff' >>> locale.setlocale(locale.LC_CTYPE, 'C') 'C' >>> time.strftime('\x80') '\x80' >>> time.strftime('\u044f') 'я' # '\u044f' >>> time.strftime('\ud800') '\ud800' >>> time.strftime('\udcff') '\udcff' * macOS 10.13.6 and FreeBSD 11.1: Python 3.7.0 (default, Jul 23 2018, 20:22:55) [Clang 9.1.0 (clang-902.0.39.2)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import time, locale >>> locale.getlocale() ('en_US', 'UTF-8') >>> time.strftime('\x80') '\x80' >>> time.strftime('\u044f') 'я' # '\u044f' >>> time.strftime('\ud800') '' >>> time.strftime('\udcff') '' >>> locale.setlocale(locale.LC_CTYPE, 'C') 'C' >>> time.strftime('\x80') '\x80' >>> time.strftime('\u044f') '' >>> time.strftime('\ud800') '' >>> time.strftime('\udcff') '' * Windows 8.1: Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32 >>> import time, locale >>> locale.getlocale() (None, None) >>> time.strftime('\x80') '\x80' >>> time.strftime('\u044f') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'locale' codec can't encode character '\u044f' in position 0: encoding error >>> time.strftime('\ud800') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'locale' codec can't encode character '\ud800' in position 0: encoding error >>> time.strftime('\udcff') 'y' # '\xff' >>> locale.setlocale(locale.LC_CTYPE, '') 'Russian_Russia.1251' >>> time.strftime('\x80') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'locale' codec can't encode character '\x80' in position 0: encoding error >>> time.strftime('\u044f') 'я' # '\u044f' >>> time.strftime('\ud800') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'locale' codec can't encode character '\ud800' in position 0: encoding error >>> time.strftime('\udcff') 'я' # '\u044f' The reasons of such differences are the following: * Reliance on either wcsftime() or strftime() from the C library depending on the platform. * For strftime(), the input is encoded into the charset of the current locale with 'surrogateescape' error handler, and the output is decoded back in the same way. * Different handling of code points which are unrepresentable in the charset of the current locale by glibc and macOS/FreeBSD. I suggest to at least document that the format string, despite being an 'str', requires special care if it contains non-ASCII code points. The 'datetime' module docs warn about the locale-dependent output, but only with regard to particular format specifiers [1]. I'll submit a draft PR. Suggestions are welcome. [1] https://docs.python.org/3.7/library/datetime.html#strftime-and-strptime-behavior ---------- assignee: docs@python components: Documentation messages: 324136 nosy: belopolsky, docs@python, izbyshev, p-ganssle, taleinat priority: normal severity: normal status: open title: Document platform-specific strftime() behavior for non-ASCII format strings type: enhancement versions: Python 3.6, Python 3.7, Python 3.8 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34512> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com