Hi,
About a month ago Steve Langasek and I discussed the state of Python
packages on IRC, in particular the effects of bytecode compilation; the
effectiveness (or lack thereof) of it, and how it tightens Python
dependencies. I'd like to propose three changes to how Python modules
are handled.
All three can be summarized as: Python should not compile stuff by
default; this is premature optimization and a waste of time, disk space,
and doesn't solve the problems anyway.
1. Stop compiling .pyo files, entirely (I'm hoping for little argument
on this).
Rationale: .pyo files are a joke. They aren't optimized in any
meaningful sense, they just have asserts removed. Examples for several
non-trivial files:
$ md5sum stock.pyc stock.pyo widgets.pyc widgets.pyo formats/_audio.pyc
formats/_audio.pyo
5ca1a79bf036e9eddf97028c00f1d0c7 stock.pyc
5ca1a79bf036e9eddf97028c00f1d0c7 stock.pyo
f6c17acdf8043bb8524834f9a5f5c747 widgets.pyc
f6c17acdf8043bb8524834f9a5f5c747 widgets.pyo
dea672e99bb57f7e7585378886eb3cb0 formats/_audio.pyc
dea672e99bb57f7e7585378886eb3cb0 formats/_audio.pyo
They also aren't even loaded unless you run python with -O, which I
don't think any Python programs in Debian do.
How?: compileall.py:57,
-cfile = fullname + (__debug__ and 'c' or 'o')
+cfile = fullname + 'c'
2. Stop compiling .pyc files (this I expect to be contentious), unless a
package wants to.
Rationale: .pyc files have a minimal gain, and numerous failings.
Advantages of .pyc files:
* .pyc files make Python imports go marginally faster. However,
for nontrivial Python programs, the import time is dwarfed
by other startup code. Some quick benchmarks show about 20% gains
for importing a .pyc over a .py. But even then, the wall-clock time
is on the order of 0.5 seconds. Lars Wirzenius mentioned that
this time matters for enemies-of-carlotta, and it probably also
matters for some CGI scripts.
* Generating them at compile-time means they won't accidentally
get generated some other time.
Disadvantages:
* They waste disk space; they use about as much as the code itself.
* It's still far too easy for modules to be regenerated for the
wrong version of Python; just run the program as root.
* .pyc files are not really architecture independent. The integer
constant 4294967296 will be a long in .pyc files compiled on 32 bit
architectures, and an int when compiled on 64 bit architectures.
The resulting module will run on both architectures, but won't
behave in the same way as a module from that machine. To be fair,
I don't know of any real-world examples that will break because
of this.
* .pyc files result in strange bugs if they are not cleaned up
properly, since Python will import them regardless of whether
or not an equivalent .py is present.
* If we don't care about byte-compilation, the multi-version
support suggested in 2.2.3 section 2 becomes much easier --
just add that directory to sys.path (or use the existing
unversioned /usr/lib/site-python). .pyc files are the rationale
between tight dependencies on Python versions, which is the last
of my suggested changes.
Another note: Currently, Python policy is based around the assumption
that .pyc files are valid within a single minor Python revision. I don't
find any evidence to support this in the Python documentation. In fact,
the marshal module documentation specifically says there are no such
guarantees. However, I don't think this has ever been a problem in
practice (if it was, we wouldn't notice, because Python just ignores
invalid pyc files).
How?: dh_python should not call compileall.py unless give some special
flag. Python policy 2.5 should change "should be generated" to "may be
generated." On the other hand, the removal code should be a "must" to
avoid littering the filesystem if .pyc files do get accidentally
generated.
I'm willing to write the patch for dh_python if there's agreement on
this.
The Python standard library should still compile .pyc files, because
this is a prerequisite for any program to make good use of .pyc files.
The problems don't apply here, because it's easy to keep the interpreter
and standard library in sync.
3. Python dependencies should be loosened (and here I expect a
flamewar).
Rational: Python migrations in Debian suck, period. One reason for this
is that every Python program and module has a strict dependency on
python >> 2.x, << 2.x+1, so during a Python migration absolutely
everything must be rebuilt. But most pure-Python programs and modules
are upward-compatible, especially these days when Debian is a minor
version behind.
Tools like dh_python do make this easier, by making backporting (or
sideporting to e.g. Ubuntu) simply a rebuild. But why bother with even
that, when it's not necessary?
Without .pyc files, there's no reason for this tight dependency at all.
Even if we keep .pyc files, I think loosening this requirement is a good
idea.