Would this work for you? def processRDD(rdd): analyzer = ShortTextAnalyzer(root_dir) rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))
ssc.union(*streams).filter(lambda x: x[1] != None) .foreachRDD(lambda rdd: processRDD(rdd)) Regards Sumit Chawla On Wed, Dec 28, 2016 at 7:57 AM, Sidney Feiner <sidney.fei...@startapp.com> wrote: > Hey, > > I just posted this question on Stack Overflow (link here > <http://stackoverflow.com/questions/41362314/pyspark-streaming-job-avoid-object-serialization>) > and decided to try my luck here as well J > > > > I'm writing a PySpark job but I got into some performance issues. > Basically, all it does is read events from Kafka and logs the > transformations made. Thing is, the transformation is calculated based on > an object's function, and that object is pretty heavy as it contains a > Graph and an inner-cache which gets automatically updated as it processes > rdd's. So when I write the following piece of code: > > analyzer = ShortTextAnalyzer(root_dir) > > logger.info("Start analyzing the documents from kafka") > > ssc.union(*streams).filter(lambda x: x[1] != None).foreachRDD(lambda rdd: > rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))) > > > > It serializes my analyzer which takes a lot of time because of the graph, > and as it is copied to the executor, the cache is only relevant for that > specific RDD. > > If the job was written in Scala, I could have written an Object which > would exist in every executor and then my object wouldn't have to be > serialized each time. > > I've read in a post (http://www.spark.tc/deserialization-in-pyspark- > storage/) that prior to PySpark 2.0, objects are always serialized. So > does that mean that I have no way to avoid the serialization? > > I'd love to hear about a way to avoid serialization in PySpark if it > exists. To have my object created once for each executor and then it could > avoid the serialization process, gain time and actually have a working > cache system? > > Thanks in advance :) > > *Sidney Feiner* */* SW Developer > > M: +972.528197720 <+972%2052-819-7720> */* Skype: sidney.feiner.startapp > > > > [image: StartApp] <http://www.startapp.com/> > > >