GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/42
[WIP] [SPARK-1132] Persisting Web UI through refactoring the SparkListener interface The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running. The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. The new architecture is as follows - The SparkUI registers a central gateway listener with the SparkContext for events. For each event this gateway listener receives, it logs the event to a file and relays it to all of its children listeners (e.g. ExecutorsListener) used by the UI. Each of these children listeners then construct the appropriate information from these events and supplies it to its parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand. Then, after the SparkContext has stopped, the SparkUI can be revived by replaying all the logged events to the children listeners through the gateway listener. This patch is WIP and still expects additional features. The main TODO's at this point include adding support for logging into HDFS, handling long running jobs (perhaps through checkpointing), and performance testing. To try this patch out, run your Spark application to completion as usual, then run ```bin/spark-class org.apache.spark.ui.UIReloader /tmp/spark-<user name>```. Your revived Spark Web UI awaits on port 14040. More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 ---- commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-04T02:18:04Z Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-04T02:18:04Z Relax assumptions on compressors and serializers when batching This commit introduces an intermediate layer of an input stream on the batch level. This guards against interference from higher level streams (i.e. compression and deserialization streams), especially pre-fetching, without specifically targeting particular libraries (Kryo) and forcing shuffle spill compression to use LZF. commit 3df700509955f7074821e9aab1e74cb53c58b5a5 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-04T02:27:49Z Merge branch 'master' of github.com:andrewor14/incubator-spark commit 287ef44e593ad72f7434b759be3170d9ee2723d2 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-04T21:38:32Z Avoid reading the entire batch into memory; also simplify streaming logic Additionally, address formatting comments. commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-04T21:44:24Z Typo: phyiscal -> physical commit 13920c918efe22e66a1760b14beceb17a61fd8cc Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-05T00:34:15Z Update docs commit 090544a87a0767effd0c835a53952f72fc8d24f0 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-05T18:58:23Z Privatize methods commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-05T20:09:32Z Also privatize fields commit e3ae35f4fb1ce8e2d7398afdbabab3dbf4bb2ffe Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-11T00:15:15Z Merge github.com:apache/incubator-spark Conflicts: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala commit 8e09306f6dd4ab421447d769572de58035d3d66a Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-12T01:48:16Z Use JSON for ExecutorsUI commit 10ed49dffe4a515bff42762cb025a3f64d9cd407 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-12T18:53:32Z Merge github.com:apache/incubator-spark into persist-ui commit dcbd312b1e4585445868dfb562f9c64ac2fc8cda Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-12T23:58:39Z Add JSON Serializability for all SparkListenerEvent's This also involves a clean-up in the way these events are structured. The existing way in which these events are defined maintains a lot of extraneous information. To avoid serializing the whole tree of RDD dependencies, for instance, this commit cherry-picks only the relevant fields. However, this means sacrificing JobLogger's functionality of tracing the entire RDD tree. Additionally, this commit also involves minor formatting and naming clean-ups within the scope of the above changes. commit bb222b9f7422cdf9e3a4c682bb271da1f75f4f75 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-13T04:35:09Z ExecutorUI: render completely from JSON Additionally, this commit fixes the bug in the local mode, where executor IDs of tasks do not match those of storage statuses (more detail in ExecutorsUI.scala). This commit currently does not serialize the SparkListenerEvents yet, but instead serializes changes to each executor JSON. This is a big TODO in the upcoming commit. commit bf0b2e9e92d760d49ba7b26aaa41b9e3aef2420f Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-14T03:12:53Z ExecutorUI: Serialize events rather than arbitary executor information This involves adding a new SparkListenerStorageFetchEvent, and adding JSON serializability to all of the objects it depends on. commit de8a1cdb833d80423aba629ba932b6f403ecd4ab Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-15T03:22:50Z Serialize events both to and from JSON (rather than just to) This requires every field of every event to be completely reconstructible from its JSON representation. This commit may contain incomplete state. commit 8a2ebe6ba37b2d5efe344aa3bea343cda1411212 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-15T06:01:21Z Fix bugs for EnvironmentUI and ExecutorsUI In particular, EnvironmentUI was not rendering until a job begins, and ExecutorsUI reports an incorrect number (format) of total tasks. commit c4cd48022b3a8dbf60f458196e21ba8c9cb3b88f Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-15T06:53:43Z Also deserialize new events This includes SparkListenerLoadEnvironment and SparkListenerStorageStatusFetch commit d859efc34c9a5f07bae7eca7b4ab72fa19fb7e29 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-15T22:01:14Z BlockManagerUI: Add JSON functionality commit 8add36bb08126fbcd02d23c446dd3ec970f1f549 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-15T22:40:49Z JobProgressUI: Add JSON functionality In addition, refactor FileLogger to log in one directory per logger commit b3976b0a2eb21b4a887d01fd16869a0f37c36f8b Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-16T06:52:43Z Add functionality of reconstructing a persisted UI from SparkContext With this commit, any reconstruct SparkUI resides on default port of 14040 onwards. Logged events are posted separately from live events, such that the live SparkListeners are not affected. This commit also fixes a few JSON de/serialization bugs. commit 4dfcd224504f392302a49ac82280b294c381f381 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-17T19:14:20Z Merge git://git.apache.org/incubator-spark into persist-ui commit f3fc13b53725cdfeddcecb2068ab5a533566772f Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-17T21:22:01Z General refactor This includes reverting previous formatting and naming changes that are irrelevant to this patch. commit 3fd584e30aaf6552179bf9e9b350b130fa92d0ad Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-18T02:01:12Z Fix two major bugs First, JobProgessListener uses HashSets of TaskInfo and StageInfo, and relies on the equality of these objects to remove from the corresponding HashSets correctly. This is not a luxury that deserialized StageInfo's and TaskInfo's have. Instead, when removing from these collections, we must match by the ID rather than the object itself. Second, although SparkUI differentiates between persisted and live UI's, its children UI's and their corresponding listeners do not. Thus, each revived UI essentially duplicated all the logs that reconstructed it in the first place. Further, these zombie UI's continued to respond to live SparkListenerEvents. This has been fixed by requiring that revived UI's do not register their listeners with the current SparkContext. With the former fix, there were major incompatibility issues with the existing way UI classes access and mutate the collections. Formatting improvements associated with smoothing out these inconsistencies are included as part of this commit. commit 5ac906d4dfd546c5d6b6e80540c8774f3985fecc Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-18T05:38:16Z Mostly naming, formatting, and code style changes commit 904c7294ac221a0cd9806af843219aaa8a847085 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-18T06:06:46Z Fix another major bug Previously, rendering the old, persisted UI continues to trigger load environment and storage status fetch events. These are now only triggered for the live UI. A related TODO: Under JobProgressUI, the total duration is inaccurate; right now it uses the time when the old UI is revived, rather than when it was live. This should be fixed. commit 427301371117e9e7889f5df0f6bba51e5916e425 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-18T23:27:39Z Add a gateway SparkListener to simplify event logging Instead of having each SparkListener log an independent set of events, centralize event logging to avoid differentiating events across UI's and thus duplicating logged events. Also rename the "fromDisk" parameter to "live". TODO: Storage page currently still relies on the previous SparkContext and is not rendering correctly. commit 64d2ce1efee3aa5a8166c5fe108932b2279217fc Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-19T02:29:21Z Fix BlockManagerUI bug by introducing new event Previously, the storage information of persisted RDD's continued to rely on the old SparkContext, which is no longer accessible if the UI is rendered from disk. This fix solves it by introducing an event, SparkListenerGetRDDInfo, which captures this information. Per discussion with Patrick, an alternative is to encapsulate this information within SparkListenerTaskEnd. This would bypass the need to create a new event, but would also require a non-trivial refactor of BlockManager / BlockStore. commit 6814da0cf9af2a29810b6773463acee3b259c95f Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-19T18:36:01Z Explicitly register each UI listener rather than through some magic This (1) allows UISparkListener to be a simple trait and (2) is more intuitive, since it mirrors sc.addSparkListener(listener), for all other non-UI listeners. commit d646df6786737d67d5ca1dbf593740a02a600991 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-20T02:47:35Z Completely decouple SparkUI from SparkContext This involves storing additional fields, such as the scheduling mode and the app name, into the new event, SparkListenerApplicationStart, since these attributes are no longer accessible without a SparkContext. Further, environment information is refactored to be loaded on application start (rather than on job start). Persisted Spark UI's can no longer be created from SparkContext. The new way of constructing them is through a standalone scala program. org.apache.spark.ui.UIReloader is introduced as an example of how to do this. commit e9e1c6dede36788d3cefe3c65366f5a79be97a1d Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-21T07:51:08Z Move all JSON de/serialization logic to JsonProtocol This makes all classes involved appear less cluttered. commit 70e7e7acf09d8efd2c7e459ee450c1db140b8f5a Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-22T02:56:26Z Formatting changes commit 6631c02a8791d0321f003bb339344445f4dd0cab Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-24T18:52:21Z More formatting changes, this time mainly for Json DSL commit bbe3501c63029ffa9c1fd9053e7ab868d0f28b10 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-26T23:27:43Z Embed storage status and RDD info in Task events This commit achieves three main things. First and foremost, it embeds the information from the SparkListenerFetchStorageStatus and SparkListenerGetRDDInfo events into events that are more descriptive of the SparkListenerInterface. In particular, every Task now maintains a list of blocks whose storage status have been updated as a result of the task. Previously, this information is retrieved from fetching storage status from the driver, an action arbitrarily associated with a stage. This change involves keeping track of what blocks are dropped during each call to an RDD persist. A big TODO is to also capture the behavior of an RDD unpersist in a SparkListenerEvent. Second, the SparkListenerEvent interface now handles the dynamic nature of Executors. In particular, a new event, SparkListenerExecutorStateChange, is introduced, which triggers a storage status fetch from the driver. The purpose of this is mainly to decouple fetching storage status from the driver from the Stage. Note that storage status is not ready until the remote BlockManagers have been registered, so this involves attaching a registration listener to the BlockManagerMasterActor. Third, changes in environment properties is now supported. This accounts for the fact that the user can invoke sc.addFile and sc.addJar in his/her own application, which should be reflected appropriately on the EnvironmentUI. In the previous implementation, coupling this information with application start prevents this from happening. Other relatively minor changes include: 1) Refactoring BlockStatus and BlockManagerInfo to not be a part of the BlockManagerMasterActor object, 2) Formatting changes, especially those involving multi-line arguments, and 3) Making all UI widgets and listeners private[ui] instead of private[spark]. commit 28019caa5712b8d7f1db039dc41876d91e530998 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-27T00:47:00Z Merge github.com:apache/spark Conflicts: core/src/main/scala/org/apache/spark/CacheManager.scala core/src/main/scala/org/apache/spark/SparkEnv.scala core/src/main/scala/org/apache/spark/scheduler/JobLogger.scala core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala core/src/main/scala/org/apache/spark/storage/MemoryStore.scala core/src/main/scala/org/apache/spark/storage/StorageUtils.scala core/src/main/scala/org/apache/spark/ui/SparkUI.scala core/src/main/scala/org/apache/spark/ui/env/EnvironmentUI.scala core/src/main/scala/org/apache/spark/ui/exec/ExecutorsUI.scala core/src/main/scala/org/apache/spark/ui/jobs/IndexPage.scala core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala core/src/main/scala/org/apache/spark/ui/jobs/PoolPage.scala core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala core/src/main/scala/org/apache/spark/ui/storage/IndexPage.scala core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala core/src/main/scala/org/apache/spark/util/Utils.scala core/src/test/scala/org/apache/spark/ui/jobs/JobProgressListenerSuite.scala commit d1f428591d6c33c2bb86f85468c7842b5ca00311 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-27T01:19:20Z Migrate from lift-json to json4s-jackson commit 7b2f8112795a53c35b10bc3d72e5be7b699ceb65 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-27T19:24:32Z Guard against TaskMetrics NPE + Fix tests commit 996d7a2f42d4e02c1e40ec22b0c4d7db86aa03e3 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-27T20:26:23Z Reflect RDD unpersist on UI This introduces a new event, SparkListenerUnpersistRDD. commit 472fd8a4845e39a38f8d993a3527a7e77571ffad Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-27T23:03:59Z Fix a couple of tests commit d47585f22f243fc7e840af90132edb7e84b003ed Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-28T00:15:21Z Clean up FileLogger commit faa113e674a276ddf5cd7dc643c16b7bed2b5e44 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-02-28T01:12:19Z General clean up ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---