Richard,
It seems whether I'm doing a foreachRDD or foreachPartition, I'm able to
create per-worker/per-JVM singletons. With 4 workers, I've got 4 singletons
created.  I wouldn't be able to use broadcast vars because the 3rd party
objects are not serializable.

The shuffling effect is basically whenever Spark has to pull data from
multiple machines together over the network when executing an action.
Probably not an issue for foreachRDD, but more for such actions as 'union'
or 'subtract' and the like.

On Wed, Jul 8, 2015 at 3:55 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> Ah, I see this is streaming. I haven't any practical experience with that
> side of Spark. But the foreachPartition idea is a good approach. I've used
> that pattern extensively, even though not for singletons, but just to
> create non-serializable objects like API and DB clients on the executor
> side. I think it's the most straightforward approach to dealing with any
> non-serializable object you need.
>
> I don't entirely follow what over-network data shuffling effects you are
> alluding to (maybe more specific to streaming?).
>
> On Wed, Jul 8, 2015 at 9:41 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> My singletons do in fact stick around. They're one per worker, looks
>> like.  So with 4 workers running on the box, we're creating one singleton
>> per worker process/jvm, which seems OK.
>>
>> Still curious about foreachPartition vs. foreachRDD though...
>>
>> On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher <
>> rmarsc...@localytics.com> wrote:
>>
>>> Would it be possible to have a wrapper class that just represents a
>>> reference to a singleton holding the 3rd party object? It could proxy over
>>> calls to the singleton object which will instantiate a private instance of
>>> the 3rd party object lazily? I think something like this might work if the
>>> workers have the singleton object in their classpath.
>>>
>>> here's a rough sketch of what I was thinking:
>>>
>>> object ThirdPartySingleton {
>>>   private lazy val thirdPartyObj = ...
>>>
>>>   def someProxyFunction() = thirdPartyObj.()
>>> }
>>>
>>> class ThirdPartyReference extends Serializable {
>>>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
>>> }
>>>
>>> also found this SO post:
>>> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>>>
>>>
>>> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg <dgoldenberg...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am seeing a lot of posts on singletons vs. broadcast variables, such
>>>> as
>>>> *
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
>>>> *
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219
>>>>
>>>> What's the best approach to instantiate an object once and have it be
>>>> reused
>>>> by the worker(s).
>>>>
>>>> E.g. I have an object that loads some static state such as e.g. a
>>>> dictionary/map, is a part of 3rd party API and is not serializable.  I
>>>> can't
>>>> seem to get it to be a singleton on the worker side as the JVM appears
>>>> to be
>>>> wiped on every request so I get a new instance.  So the singleton
>>>> doesn't
>>>> stick.
>>>>
>>>> Is there an approach where I could have this object or a wrapper of it
>>>> be a
>>>> broadcast var? Can Kryo get me there? would that basically mean writing
>>>> a
>>>> custom serializer?  However, the 3rd party object may have a bunch of
>>>> member
>>>> vars hanging off it, so serializing it properly may be non-trivial...
>>>>
>>>> Any pointers/hints greatly appreciated.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to