How? Example please.. Also, if I am running this in pyspark shell.. how do i configure spark.akka.frameSize ??
On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > When the data size is huge, you better of use the torrentBroadcastFactory. > > Thanks > Best Regards > > On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu <chengi.liu...@gmail.com> > wrote: > >> Specifically the error I see when I try to operate on rdd created by >> sc.parallelize method >> : org.apache.spark.SparkException: Job aborted due to stage failure: >> Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize >> (10485760 bytes). Consider using broadcast variables for large values. >> >> On Sun, Sep 14, 2014 at 2:20 AM, Chengi Liu <chengi.liu...@gmail.com> >> wrote: >> >>> Hi, >>> I am trying to create an rdd out of large matrix.... sc.parallelize >>> suggest to use broadcast >>> But when I do >>> >>> sc.broadcast(data) >>> I get this error: >>> >>> Traceback (most recent call last): >>> File "<stdin>", line 1, in <module> >>> File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line >>> 370, in broadcast >>> pickled = pickleSer.dumps(value) >>> File "/usr/common/usg/spark/1.0.2/python/pyspark/serializers.py", line >>> 279, in dumps >>> def dumps(self, obj): return cPickle.dumps(obj, 2) >>> SystemError: error return without exception set >>> Help? >>> >>> >> >