Ah, I see this is streaming. I haven't any practical experience with that
side of Spark. But the foreachPartition idea is a good approach. I've used
that pattern extensively, even though not for singletons, but just to
create non-serializable objects like API and DB clients on the executor
side. I think it's the most straightforward approach to dealing with any
non-serializable object you need.

I don't entirely follow what over-network data shuffling effects you are
alluding to (maybe more specific to streaming?).

On Wed, Jul 8, 2015 at 9:41 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
wrote:

> My singletons do in fact stick around. They're one per worker, looks
> like.  So with 4 workers running on the box, we're creating one singleton
> per worker process/jvm, which seems OK.
>
> Still curious about foreachPartition vs. foreachRDD though...
>
> On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Would it be possible to have a wrapper class that just represents a
>> reference to a singleton holding the 3rd party object? It could proxy over
>> calls to the singleton object which will instantiate a private instance of
>> the 3rd party object lazily? I think something like this might work if the
>> workers have the singleton object in their classpath.
>>
>> here's a rough sketch of what I was thinking:
>>
>> object ThirdPartySingleton {
>>   private lazy val thirdPartyObj = ...
>>
>>   def someProxyFunction() = thirdPartyObj.()
>> }
>>
>> class ThirdPartyReference extends Serializable {
>>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
>> }
>>
>> also found this SO post:
>> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>>
>>
>> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg <dgoldenberg...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am seeing a lot of posts on singletons vs. broadcast variables, such as
>>> *
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
>>> *
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219
>>>
>>> What's the best approach to instantiate an object once and have it be
>>> reused
>>> by the worker(s).
>>>
>>> E.g. I have an object that loads some static state such as e.g. a
>>> dictionary/map, is a part of 3rd party API and is not serializable.  I
>>> can't
>>> seem to get it to be a singleton on the worker side as the JVM appears
>>> to be
>>> wiped on every request so I get a new instance.  So the singleton doesn't
>>> stick.
>>>
>>> Is there an approach where I could have this object or a wrapper of it
>>> be a
>>> broadcast var? Can Kryo get me there? would that basically mean writing a
>>> custom serializer?  However, the 3rd party object may have a bunch of
>>> member
>>> vars hanging off it, so serializing it properly may be non-trivial...
>>>
>>> Any pointers/hints greatly appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to