I've been investigating why some Python packages are unreproducible[1] and have discovered that in some cases the problem can be traced to CPython's hash randomization. This happens any time a package writes files that depend on the iteration order over dictionaries or sets. An example is python-phply[2], which depends on PLY[3], an LALR parser for Python. After being given a grammar, PLY generates LALR parse tables and writes these tables to a file to avoid needing to regenerate them, and in generating the file, PLY iterates over dict.items()[4]. This problem has also occurred in other contexts, for instance Sphinx had a reproducibility issue[5] that related to hash randomization. Another example is pickle: running the following script under CPython will generate different pickle files with different values of PYTHONHASHSEED because the order in which a dictionary is created affects its pickle.
import pickle d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5} pickle.dump(dict(d.items()), open('temp.pickle', 'wb')) There's often no simple solution for these problems at the level of the packages themselves. In PLY's case, trying to sort the parse tables before writing them to file doesn't work because of how it iterates over dictionaries during table generation[6]. I doubt that the other proposed solution in that Github issue, using an ordered dictionary, will be accepted by David Beaz because it would cause a significant performance hit on CPython <3.4, particularly CPython 2.7, because a C implementation of ordered dictionaries was only added in 3.5. More broadly, trying to patch every individual Python package that's affected is impractical, both because of the number of affected packages and the possibility that any individual patch can be quite complicated if it's even possible. I think a better solution is disabling hash randomization by setting PYTHONHASHSEED=0 when building Python packages with CPython for Debian, probably somewhere in dh-python. Note that this isn't necessary for PyPy, which doesn't have hash randomization[7]. Hash randomization was implemented to prevent, "[H]ash collisions [being] exploited to DoS a web framework that automatically parses input forms into dictionaries"[8]. This shouldn't be an issue at build-time, as any time CPython is run to read in the files written during the build, hash randomization will be enabled again. Ceridwen [1] https://wiki.debian.org/ReproducibleBuilds [2] https://packages.debian.org/stretch/python/python-phply [3] https://github.com/dabeaz/ply [4] https://github.com/dabeaz/ply/blob/master/ply/yacc.py#L2733 [5] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795976;msg=29 [6] https://github.com/dabeaz/ply/issues/79 [7] http://doc.pypy.org/en/latest/cpython_differences.html#miscellaneous [8] https://bugs.python.org/issue13703