2to3 converts syntactically valid 2.x code to syntactically valid 3.x code. It cannot, however, guarantee semantic correctness. A particular problem is that str is semantically ambiguous in 2.x, as it is used both for text encoded as bytes and binary data.

To resolve the ambiguity for conversions to 3.x, 2.6 introduced 'bytes' as a synonym for 'str'. The intention is that one use 'bytes' to create or refer to 2.x bytes that should remain bytes in 3.x and use 'str' to create or refer to 2.x text bytes that should become or will be unicode in 3.x. 3.x and hence 2to3 *assume* that one is using 'bytes' and 'str' this way, so that 'unicode' becomes an unneeded synonym for 'str' and 2to3 changes 'unicode' to 'str'. If one does not use 'str' and 'bytes' as intended, 2to3 may produce semantically different code.

2.3 introduced abstract superclass 'basestring', which can be viewed as Union(unicode, str). "isinstance(value, basestring)" is defined as "isinstance(value, (unicode, str))" I believe the intended meaning was 'text, whether unicode or encoded bytes'. Certainly, any code following
  if isinstance(value, basestring):
would likely only make sense if that were true.

In any case, after 2.6, one should only use 'basestring' when the 'str' part has its restricted meaning of 'unicode in 3.x'. "(unicode, bytes)" is semantically different from "basestring" and "(unicode, str)" when used in isinstance. 2to3 converts then to "(std, bytes)", 'str', and '(str, str)' (the same as 'str' when used in isinstance). If one uses 'basestring' when one means '(unicode, bytes)', 2to3 may produce semantically different code.

Example based on https://bugs.python.org/issue38003:

if isinstance(value, basestring):
    if not isinstance(value, unicode):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

2to3 produces

if isinstance(value, str):
    if not isinstance(value, str):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

If, in 3.x, value is always unicode, then the inner conditional is dead and can be removed. But if, in 3.x, value might be byte-encoded text, it will not be decoded and the code is wrong. Fixes:

1. Instead of decoding value after the check, do it before the check. I think this is best for new code.

if isinstance(value, bytes):
    value = value.decode(encoding)
...
if isinstance(value, unicode):
    process_text(value)
else:
    process_nontext(value)

2. Replace 'basestring' with '(unicode, bytes)'. This is easier with existing code.

if isinstance(value, basestring):
    if not isinstance(value, unicode):
        value = value.decode(encoding)
    process_text(value)
else:
    process_nontext(value)

(I believe but have not tested that) 2to3 produces correct 3.x code from either 1 or 2 after replacing 'unicode' with 'str'.

In both cases, the 'unicode' to 'str' replacement should result in correct 3.x code.

3. Edit Lib/lib2to3/fixes/fix_basestring.py to replace 'basestring' with '(str, bytes)' instead of 'str'. This should be straightforward if one understands the ast format.


Note that 2to3 is not meant for 2&3 code using exception tricks and six/future imports. Turning 2&3 code into idiomatic 3-only code is a separate subject.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to