Github user tdas commented on the pull request: https://github.com/apache/spark/pull/126#issuecomment-37498313 @yaoshengzhe This is only safe, best-effort attempt to clean metadata, so not guarantee is being provided here. All we are trying to do for long running Spark computations (say, Spark Streaming program that runs 24/7) there is _something_ that cleans up in a safe way. I am taking care to make sure call to the finalize() is cheap, just a insert to a queue which does not block (inserts in LinkedBlockingQueue without capacity constraint does not block for all practical purposes). Regarding phantom references, from what I understand, does not provide any more guarantee on when garbage collection is performed than the current method. It just gets called after finalize is done on objects. The main source of uncertainty comes directly from the garbage collection step, which cannot be avoided by any method. Rather, using any sort of weak or phantom reference queue requires _every_ RDD to be wrapped by WeakReference or PhantomReference. That is seems to me to be an unnecessary complexity with little added benefit.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---