2to3, str, and basestring

Terry Reedy Sat, 07 Sep 2019 12:34:42 -0700

2to3 converts syntactically valid 2.x code to syntactically valid 3.xcode. It cannot, however, guarantee semantic correctness. A particularproblem is that str is semantically ambiguous in 2.x, as it is used bothfor text encoded as bytes and binary data.

To resolve the ambiguity for conversions to 3.x, 2.6 introduced 'bytes'as a synonym for 'str'. The intention is that one use 'bytes' to createor refer to 2.x bytes that should remain bytes in 3.x and use 'str' tocreate or refer to 2.x text bytes that should become or will be unicodein 3.x. 3.x and hence 2to3 *assume* that one is using 'bytes' and 'str'this way, so that 'unicode' becomes an unneeded synonym for 'str' and2to3 changes 'unicode' to 'str'. If one does not use 'str' and 'bytes'as intended, 2to3 may produce semantically different code.

2.3 introduced abstract superclass 'basestring', which can be viewed asUnion(unicode, str). "isinstance(value, basestring)" is defined as"isinstance(value, (unicode, str))" I believe the intended meaning was'text, whether unicode or encoded bytes'. Certainly, any code following

  if isinstance(value, basestring):
would likely only make sense if that were true.

In any case, after 2.6, one should only use 'basestring' when the 'str'part has its restricted meaning of 'unicode in 3.x'. "(unicode, bytes)"is semantically different from "basestring" and "(unicode, str)" whenused in isinstance. 2to3 converts then to "(std, bytes)", 'str', and'(str, str)' (the same as 'str' when used in isinstance). If one uses'basestring' when one means '(unicode, bytes)', 2to3 may producesemantically different code.


Example based on https://bugs.python.org/issue38003:

if isinstance(value, basestring):
    if not isinstance(value, unicode):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

2to3 produces

if isinstance(value, str):
    if not isinstance(value, str):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

If, in 3.x, value is always unicode, then the inner conditional is deadand can be removed. But if, in 3.x, value might be byte-encoded text,it will not be decoded and the code is wrong. Fixes:

1. Instead of decoding value after the check, do it before the check. Ithink this is best for new code.


if isinstance(value, bytes):
    value = value.decode(encoding)
...
if isinstance(value, unicode):
    process_text(value)
else:
    process_nontext(value)

2. Replace 'basestring' with '(unicode, bytes)'. This is easier withexisting code.


if isinstance(value, basestring):
    if not isinstance(value, unicode):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

(I believe but have not tested that) 2to3 produces correct 3.x code fromeither 1 or 2 after replacing 'unicode' with 'str'.

In both cases, the 'unicode' to 'str' replacement should result incorrect 3.x code.

3. Edit Lib/lib2to3/fixes/fix_basestring.py to replace 'basestring' with'(str, bytes)' instead of 'str'. This should be straightforward if oneunderstands the ast format.

Note that 2to3 is not meant for 2&3 code using exception tricks andsix/future imports. Turning 2&3 code into idiomatic 3-only code is aseparate subject.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

2to3, str, and basestring

Reply via email to