Currently, PySpark can not support pickle a class object in current
script ( '__main__'), the workaround could be put the implementation
of the class into a separate module, then use "bin/spark-submit
--py-files xxx.py" in deploy it.
in xxx.py:
class test(object):
def __init__(self, a, b):
self.total = a + b
in job.py:
from xxx import test
a = sc.parallelize([(True,False),(False,False)])
a.map(lambda (x,y): test(x,y))
run it by:
bin/spark-submit --py-files xxx.py job.py
On Wed, Feb 18, 2015 at 1:48 PM, Guillaume Guy
<[email protected]> wrote:
> Hi,
>
> This is a duplicate of the stack-overflow question here. I hope to generate
> more interest on this mailing list.
>
>
> The problem:
>
> I am running into some attribute lookup problems when trying to initiate a
> class within my RDD.
>
> My workflow is quite standard:
>
> 1- Start with an RDD
>
> 2- Take each element of the RDD, initiate an object for each
>
> 3- Reduce (I will write a method that will define the reduce operation later
> on)
>
> Here is #2:
>
> class test(object):
> def __init__(self, a,b):
> self.total = a + b
>
> a = sc.parallelize([(True,False),(False,False)])
> a.map(lambda (x,y): test(x,y))
>
> Here is the error I get:
>
> PicklingError: Can't pickle < class 'main.test' >: attribute lookup
> main.test failed
>
> I'd like to know if there is any way around it. Please, answer with a
> working example to achieve the intended results (i.e. creating a RDD of
> objects of class "tests").
>
> Thanks in advance!
>
> Related question:
>
> https://groups.google.com/forum/#!topic/edx-code/9xzRJFyQwn
>
>
> GG
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]