Hi Gerard, thanks for the hint with the Singleton object. Seems very interesting. However, when my singleton object (e.g. handle to my DB) is supposed to have a member variable that is non-serializable I again will have a problem, won’t I? At least I always run into issues that Python tries to pickle it and can’t do so.
I am doing something like that: count = sc.parallelize(xrange(0, 100)).map( SingletonMapper() ).collect() I do something like that: class SingletonMapper(object): class __ SingletonMapper: db = None def __init__(self): self.db = ConnectDB() def __call__(self, x): return self.db.get(x) instance = None def __new__(cls): if not SingletonMapper.instance: SingletonMapper.instance = SingletonMapper.__ SingletonMapper() return SingletonMapper.instance def __getattr__(self, name): return getattr(self.instance, name) def __setattr__(self, name): return setattr(self.instance, name) Best, Tassilo On Fri, Apr 10, 2015 at 5:55 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Tassilo, > > Singleton objects are ideal for this purpose. They are initialized once > per JVM and will keep whatever data local to the node they reside on. > > What you need to keep in mind is the interaction of Spark with that > singleton object: > - Prefer the usage of [map|foreach]Partition to interact with your object, > specially when it involves costly resources (db connections, sockets, ...) > - Keep your operations idempotent, it is easy to forget that Spark might > recompute the complete lineage on each action and that will trigger > whatever interaction with your singleton object more than once. > - strategically placed 'rdd.cache' will help you break the lineage where > needed, helping control the effects of the previous point. > > -kr, Gerard. > > > On Fri, Apr 10, 2015 at 6:41 AM, TJ Klein <tjkl...@gmail.com> wrote: > >> Hi, >> >> I wonder if there is any possibility to keep a local variable on each >> computing node (apart from the RDD)? >> In my scenario I'd like to keep a structure on each node that cannot be >> serialized and should be initialized the first time and then used in all >> iterations to come. >> >> Best, >> Tassilo >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Keep-local-variable-tp22451.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >