Would this work for you?

def processRDD(rdd):
    analyzer = ShortTextAnalyzer(root_dir)
    rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))

ssc.union(*streams).filter(lambda x: x[1] != None)
.foreachRDD(lambda rdd: processRDD(rdd))



Regards
Sumit Chawla


On Wed, Dec 28, 2016 at 7:57 AM, Sidney Feiner <sidney.fei...@startapp.com>
wrote:

> Hey,
>
> I just posted this question on Stack Overflow (link here
> <http://stackoverflow.com/questions/41362314/pyspark-streaming-job-avoid-object-serialization>)
> and decided to try my luck here as well J
>
>
>
> I'm writing a PySpark job but I got into some performance issues.
> Basically, all it does is read events from Kafka and logs the
> transformations made. Thing is, the transformation is calculated based on
> an object's function, and that object is pretty heavy as it contains a
> Graph and an inner-cache which gets automatically updated as it processes
> rdd's. So when I write the following piece of code:
>
> analyzer = ShortTextAnalyzer(root_dir)
>
> logger.info("Start analyzing the documents from kafka")
>
> ssc.union(*streams).filter(lambda x: x[1] != None).foreachRDD(lambda rdd: 
> rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])))
>
>
>
> It serializes my analyzer which takes a lot of time because of the graph,
> and as it is copied to the executor, the cache is only relevant for that
> specific RDD.
>
> If the job was written in Scala, I could have written an Object which
> would exist in every executor and then my object wouldn't have to be
> serialized each time.
>
> I've read in a post (http://www.spark.tc/deserialization-in-pyspark-
> storage/) that prior to PySpark 2.0, objects are always serialized. So
> does that mean that I have no way to avoid the serialization?
>
> I'd love to hear about a way to avoid serialization in PySpark if it
> exists. To have my object created once for each executor and then it could
> avoid the serialization process, gain time and actually have a working
> cache system?
>
> Thanks in advance :)
>
> *Sidney Feiner*   */*  SW Developer
>
> M: +972.528197720 <+972%2052-819-7720>  */*  Skype: sidney.feiner.startapp
>
>
>
> [image: StartApp] <http://www.startapp.com/>
>
>
>

Reply via email to