zip in pyspark truncates RDD to number of processors

2014-06-21 Thread madeleine
Consider the following simple zip: n = 6 a = sc.parallelize(range(n)) b = sc.parallelize(range(n)).map(lambda j: j) c = a.zip(b) print a.count(), b.count(), c.count() >> 6 6 4 by varying n, I find that c.count() is always min(n,4), where 4 happens to be the number of threads on my computer. by

Re: pyspark serializer can't handle functions?

2014-06-16 Thread madeleine
ions (see > python/pyspark/cloudpickle.py). However I’m also curious, why do you need > an RDD of functions? > > Matei > > On Jun 15, 2014, at 4:49 PM, madeleine <[hidden email] > <http://user/SendEmail.jtp?type=node&node=7682&i=0>> wrote: > > > It

pyspark serializer can't handle functions?

2014-06-15 Thread madeleine
It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sam