Re: zip two RDD in pyspark

2014-07-29 Thread Nick Pentreath
parallelize uses the default Serializer (PickleSerializer) while textFile uses UTF8Serializer. You can get around this with index.zip(input_data._reserialize()) (or index.zip(input_data.map(lambda x: x))) (But if you try to just do this, you run into the issue with different number of partitions

Re: zip two RDD in pyspark

2014-07-29 Thread Davies Liu
On Mon, Jul 28, 2014 at 12:58 PM, l wrote: > I have a file in s3 that I want to map each line with an index. Here is my > code: > input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache() N input_data.count() index = sc.parallelize(range(N), 6) index.zip(input_data).