Friends, For context (so to speak), I did some work in the 0.9 timeframe to fix SPARK-897 (provide immediate feedback when closures aren't serializable) and SPARK-729 (make sure that free variables in closures are captured when the RDD transformations are declared).
I currently have a branch addressing SPARK-897 that builds and tests out against 0.9, 1.0, and master last I checked (https://github.com/apache/spark/pull/143). My branch addressing SPARK-729 builds on my SPARK-897 branch, and passed the test suite in 0.9[1]. However, some things that changed or were added in 1.0 wound up depending on the old behavior. I've been working on other things lately but would like to get these issues fixed after 1.0 goes final so I was hoping to get a bit of discussion on the best way to go forward with an issue that I haven't solved yet: ContextCleaner uses weak references to track broadcast variables. Because weak references obviously don't track cloned objects (or those that have been serialized and deserialized), capturing free variables in closures in the obvious way (i.e. by replacing the closure with a copy that has been serialized and deserialized) results in an undesirable situation: we might have, e.g., live HTTP broadcast variable objects referring to filesystem resources that could be cleaned at any time because the objects that they were cloned from have become only weakly reachable. To be clear, this isn't a problem now; it's only a problem for the way I'm proposing to fix SPARK-729. With that said, I'm wondering if it would make more sense to fix this problem by adding a layer of indirection to reference count external and persisting resources rather than the objects that putatively own them, or if it would make more sense to take a more sophisticated (but also more potentially fragile) approach to ensuring variable capture. thanks, wb [1] Serializing closures also created or uncovered a PySpark issue in 0.9 (and presumably in later versions as well) that requires further investigation, but my patch did include a workaround; here are the details: https://issues.apache.org/jira/browse/SPARK-1454