Hi Gerard,

thanks for the hint with the Singleton object. Seems very interesting.
However, when my singleton object (e.g. handle to my DB) is supposed to
have a member variable that is non-serializable I again will have a
problem, won’t I? At least I always run into issues that Python tries to
pickle it and can’t do so.

I am doing something like that:

count = sc.parallelize(xrange(0, 100)).map( SingletonMapper() ).collect()

I do something like that:
class SingletonMapper(object): class __ SingletonMapper: db = None def
__init__(self): self.db = ConnectDB() def __call__(self, x): return
self.db.get(x) instance = None def __new__(cls): if not
SingletonMapper.instance: SingletonMapper.instance = SingletonMapper.__
SingletonMapper() return SingletonMapper.instance def __getattr__(self,
name): return getattr(self.instance, name) def __setattr__(self, name):
return setattr(self.instance, name)

Best,
Tassilo
​

On Fri, Apr 10, 2015 at 5:55 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Hi Tassilo,
>
> Singleton objects are ideal for this purpose. They are initialized once
> per JVM and will keep whatever data local to the node they reside on.
>
> What you need to keep in mind is the interaction of Spark with that
> singleton object:
> - Prefer the usage of [map|foreach]Partition to interact with your object,
> specially when it involves costly resources (db connections, sockets, ...)
> - Keep your operations idempotent, it is easy to forget that Spark might
> recompute the complete lineage on each action and that will trigger
> whatever interaction with your singleton object more than once.
> - strategically placed 'rdd.cache'  will help you break the lineage where
> needed, helping control the effects of the previous point.
>
> -kr, Gerard.
>
>
> On Fri, Apr 10, 2015 at 6:41 AM, TJ Klein <tjkl...@gmail.com> wrote:
>
>> Hi,
>>
>> I wonder if there is any possibility to keep a local variable on each
>> computing node (apart from the RDD)?
>> In my scenario I'd like to keep a structure on each node that cannot be
>> serialized and should be initialized the first time and then used in all
>> iterations to come.
>>
>> Best,
>>  Tassilo
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Keep-local-variable-tp22451.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to