Consider the following simple zip:
n = 6
a = sc.parallelize(range(n))
b = sc.parallelize(range(n)).map(lambda j: j)
c = a.zip(b)
print a.count(), b.count(), c.count()
>> 6 6 4
by varying n, I find that c.count() is always min(n,4), where 4 happens to
be the number of threads on my computer. by
ions (see
> python/pyspark/cloudpickle.py). However I’m also curious, why do you need
> an RDD of functions?
>
> Matei
>
> On Jun 15, 2014, at 4:49 PM, madeleine <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=7682&i=0>> wrote:
>
> > It
It seems that the default serializer used by pyspark can't serialize a list
of functions.
I've seen some posts about trying to fix this by using dill to serialize
rather than pickle.
Does anyone know what the status of that project is, or whether there's
another easy workaround?
I've pasted a sam