GitHub user willb opened a pull request: https://github.com/apache/spark/pull/143
SPARK-897: preemptively serialize closures These commits cause `ClosureCleaner.clean` to attempt to serialize the cleaned closure with the default closure serializer and throw a `SparkException` if doing so fails. This behavior is enabled by default but can be disabled at individual callsites of `SparkContext.clean`. Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work exposed; I'm happy to put that in a separate PR if that would be more appropriate. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-897 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #143 ---- commit bcab2f0414a956ffa89c5dd0fee16de1b33320a2 Author: William Benton <wi...@redhat.com> Date: 2014-03-13T02:56:32Z Test case for SPARK-897. Tests to make sure that passing an unserializable closure to a transformation fails fast. commit f2ef54e4ec92d8f0ee3e91af4f507bcabd29a7c0 Author: William Benton <wi...@redhat.com> Date: 2014-03-13T19:21:45Z Generalized proactive closure serialization test. commit 6cb921874c02f3f03dd66db697c6995dc9565a0f Author: William Benton <wi...@redhat.com> Date: 2014-03-13T19:40:42Z Adds proactive closure-serializablilty checking ClosureCleaner.clean now checks to ensure that its closure argument is serializable by default and throws a SparkException with the underlying NotSerializableException in the detail message otherwise. As a result, transformation invocations with unserializable closures will fail at their call sites rather than when they actually execute. ClosureCleaner.clean now takes a second boolean argument; pass false to disable serializability-checking behavior at call sites where this behavior isn't desired. commit 98e01ae854dd3fce03d753d5f25a6022ae6f58d6 Author: William Benton <wi...@redhat.com> Date: 2014-03-14T16:40:56Z Ensure assertions in Graph.apply are asserted. The Graph.apply test in GraphSuite had some assertions in a closure in a graph transformation. This caused two problems: 1. because assert() was called, test classes were reachable from the closures, which made them not serializable, and 2. (more importantly) these assertions never actually executed, since they occurred within a lazy map() This commit simply changes the Graph.apply test to collects the graph triplets so it can assert about each triplet from a map method. commit 70a449d87018e7bfa8dbf7249948a7f48a891719 Author: William Benton <wi...@redhat.com> Date: 2014-03-14T17:33:33Z Make proactive serializability checking optional. SparkContext.clean uses ClosureCleaner's proactive serializability checking by default. This commit adds an overloaded clean method to SparkContext that allows clients to specify that serializability checking should not occur as part of closure cleaning. commit 9eb301387644d5c14a03a0bbb96c6b007f228f3d Author: William Benton <wi...@redhat.com> Date: 2014-03-14T17:34:42Z Don't check serializability of DStream transforms. Since the DStream is reachable from within these closures, they aren't checkable by the straightforward technique of passing them to the closure serializer. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---