from:"madeleine"

zip in pyspark truncates RDD to number of processors

2014-06-21 Thread madeleine

Consider the following simple zip: n = 6 a = sc.parallelize(range(n)) b = sc.parallelize(range(n)).map(lambda j: j) c = a.zip(b) print a.count(), b.count(), c.count() >> 6 6 4 by varying n, I find that c.count() is always min(n,4), where 4 happens to be the number of threads on my computer. by

Re: pyspark serializer can't handle functions?

2014-06-16 Thread madeleine

ions (see > python/pyspark/cloudpickle.py). However I’m also curious, why do you need > an RDD of functions? > > Matei > > On Jun 15, 2014, at 4:49 PM, madeleine <[hidden email] > <http://user/SendEmail.jtp?type=node&node=7682&i=0>> wrote: > > > It

pyspark serializer can't handle functions?

2014-06-15 Thread madeleine

It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sam

zip in pyspark truncates RDD to number of processors

Re: pyspark serializer can't handle functions?

pyspark serializer can't handle functions?

3 matches

Site Navigation

Mail list logo

Footer information