Ah, I see this is streaming. I haven't any practical experience with that side of Spark. But the foreachPartition idea is a good approach. I've used that pattern extensively, even though not for singletons, but just to create non-serializable objects like API and DB clients on the executor side. I think it's the most straightforward approach to dealing with any non-serializable object you need.
I don't entirely follow what over-network data shuffling effects you are alluding to (maybe more specific to streaming?). On Wed, Jul 8, 2015 at 9:41 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote: > My singletons do in fact stick around. They're one per worker, looks > like. So with 4 workers running on the box, we're creating one singleton > per worker process/jvm, which seems OK. > > Still curious about foreachPartition vs. foreachRDD though... > > On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher < > rmarsc...@localytics.com> wrote: > >> Would it be possible to have a wrapper class that just represents a >> reference to a singleton holding the 3rd party object? It could proxy over >> calls to the singleton object which will instantiate a private instance of >> the 3rd party object lazily? I think something like this might work if the >> workers have the singleton object in their classpath. >> >> here's a rough sketch of what I was thinking: >> >> object ThirdPartySingleton { >> private lazy val thirdPartyObj = ... >> >> def someProxyFunction() = thirdPartyObj.() >> } >> >> class ThirdPartyReference extends Serializable { >> def someProxyFunction() = ThirdPartySingleton.someProxyFunction() >> } >> >> also found this SO post: >> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers >> >> >> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg <dgoldenberg...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am seeing a lot of posts on singletons vs. broadcast variables, such as >>> * >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html >>> * >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219 >>> >>> What's the best approach to instantiate an object once and have it be >>> reused >>> by the worker(s). >>> >>> E.g. I have an object that loads some static state such as e.g. a >>> dictionary/map, is a part of 3rd party API and is not serializable. I >>> can't >>> seem to get it to be a singleton on the worker side as the JVM appears >>> to be >>> wiped on every request so I get a new instance. So the singleton doesn't >>> stick. >>> >>> Is there an approach where I could have this object or a wrapper of it >>> be a >>> broadcast var? Can Kryo get me there? would that basically mean writing a >>> custom serializer? However, the 3rd party object may have a bunch of >>> member >>> vars hanging off it, so serializing it properly may be non-trivial... >>> >>> Any pointers/hints greatly appreciated. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >